Back to Blog

LLM-Powered Agents: Technical Deep Dive

May 21, 20257 min readTeam 400

This post is technical. If you're a business stakeholder wanting to understand AI agents, read our non-technical guide instead.

Still here? Good. Let's get into the details of how LLM-powered agents actually work.

The Basic Agent Loop

At its core, an LLM agent is a loop:

while not task_complete:
    1. Observe: Gather current state and context
    2. Reason: LLM decides what to do next
    3. Act: Execute the chosen action
    4. Evaluate: Check if goal is achieved

The sophistication comes from how each step is implemented.

Model Selection

Not all LLMs are equal for agent tasks. Considerations:

Reasoning Capability

Agents need models that can:

  • Follow multi-step instructions
  • Use tools correctly
  • Know when they don't know something
  • Recover from errors

As of early 2025, our typical choices:

  • GPT-4 / GPT-4 Turbo: Strong reasoning, good tool use, expensive
  • Claude 3 Opus/Sonnet: Excellent reasoning, better at nuance, good safety properties
  • Llama 3 70B: Capable, can self-host, lower inference cost
  • Mistral Large: Good balance of capability and cost

We avoid GPT-3.5 and smaller models for autonomous agent tasks. The error rate is too high.

Context Window

Agents often need long context:

  • Conversation history
  • Retrieved documents
  • Tool definitions
  • System instructions

GPT-4 Turbo's 128K context and Claude's 200K context are game-changers. You can fit substantial knowledge without aggressive truncation.

Tool Use Support

Native function calling is now standard:

  • OpenAI function calling
  • Anthropic tool use
  • Structured output modes

These are more reliable than prompting the model to output JSON. Use them.

Cost at Scale

At high volume, cost matters. Rough pricing (early 2025):

  • GPT-4 Turbo: ~$10-30 per 1M tokens
  • Claude Sonnet: ~$3-15 per 1M tokens
  • Self-hosted Llama: Infrastructure cost only

For an agent handling 10K interactions/month with average 4K tokens each, that's:

  • GPT-4 Turbo: ~$400-1200/month on API alone
  • Self-hosted: ~$200-500/month on infrastructure

Worth modelling before committing.

Prompt Architecture

The system prompt is the foundation. Structure it carefully.

Core Components

# Role
You are [agent name], an AI assistant that [core purpose].

# Capabilities
You have access to these tools:
- [tool 1]: [description, when to use]
- [tool 2]: [description, when to use]

# Constraints
- Never [hard boundaries]
- Always [required behaviors]
- When uncertain, [fallback behavior]

# Process
1. [Step-by-step workflow]
2. [Decision points]
3. [Escalation criteria]

# Communication Style
- [Tone guidelines]
- [Formatting preferences]

Dynamic Context

Added to each request:

  • Current conversation history
  • Retrieved relevant information
  • Current state/variables
  • User context (if available)

Token Budget Management

With limited context windows, you need strategies:

Conversation summarisation: Periodically compress old conversation turns Sliding window: Keep N recent turns, summarise the rest Selective retrieval: Only include retrieved docs that score above threshold Dynamic few-shot: Include examples relevant to current query

Tool Design

Tools are how agents interact with the world.

Tool Definition

Each tool needs:

{
  "name": "update_customer_address",
  "description": "Updates a customer's shipping address. Use when customer requests address change.",
  "parameters": {
    "type": "object",
    "properties": {
      "customer_id": {
        "type": "string",
        "description": "The unique customer identifier"
      },
      "new_address": {
        "type": "object",
        "properties": {
          "street": {"type": "string"},
          "city": {"type": "string"},
          "state": {"type": "string"},
          "postcode": {"type": "string"}
        },
        "required": ["street", "city", "state", "postcode"]
      }
    },
    "required": ["customer_id", "new_address"]
  }
}

Key principles:

  • Clear, specific descriptions
  • Strongly typed parameters
  • Validation constraints where possible
  • Examples in description if helpful

Tool Execution Layer

The tool gateway handles:

def execute_tool(tool_name: str, parameters: dict) -> ToolResult:
    # 1. Validate parameters
    validate_parameters(tool_name, parameters)

    # 2. Check permissions
    check_permissions(current_user, tool_name, parameters)

    # 3. Log the attempt
    log_tool_call(tool_name, parameters)

    # 4. Execute with timeout
    with timeout(TOOL_TIMEOUT):
        result = tools[tool_name].execute(parameters)

    # 5. Log the result
    log_tool_result(tool_name, result)

    # 6. Return structured result
    return ToolResult(
        success=result.success,
        data=result.data,
        error=result.error
    )

Never let the LLM execute arbitrary code or call APIs directly.

Memory Systems

Agents need memory beyond single interactions.

Short-Term Memory

The current conversation. Usually kept in full within context window.

Working Memory

Task-specific state during a multi-turn interaction:

class ConversationState:
    intent: str
    collected_data: dict
    steps_completed: list
    pending_actions: list
    confidence: float

Persisted between turns, included in prompt.

Long-Term Memory

Persistent information about users, preferences, past interactions:

# Retrieve relevant memory
memories = memory_store.search(
    user_id=user.id,
    query=current_query,
    limit=5,
    recency_weight=0.3
)

# Include in prompt
context += format_memories(memories)

Vector databases (Pinecone, Weaviate, Qdrant) work well for this.

Memory Updates

After each interaction:

# Extract memorable information
important_info = extract_memories(conversation)

# Store with metadata
for info in important_info:
    memory_store.add(
        user_id=user.id,
        content=info.content,
        embedding=embed(info.content),
        metadata={
            "timestamp": now(),
            "conversation_id": conv_id,
            "type": info.type
        }
    )

Be selective. Not everything should be remembered.

Retrieval Architecture

RAG (Retrieval-Augmented Generation) gives agents access to knowledge.

Indexing Pipeline

Documents → Chunking → Embedding → Vector Store
                ↓
          Metadata Extraction

Chunking strategy matters:

  • Too small: loses context
  • Too large: dilutes relevance
  • Overlapping windows often help

We typically use 500-1000 token chunks with 100 token overlap.

Retrieval Pipeline

Query → Query Embedding → Vector Search → Reranking → Selection
                                              ↓
                                    Diversity Filter
                                              ↓
                                    Recency Filter

Don't just take top-K by similarity. Consider:

  • Result diversity (avoid redundancy)
  • Source authority (official docs > wiki > emails)
  • Recency (for time-sensitive info)

Hybrid Search

Combine vector search with keyword search:

vector_results = vector_db.search(query_embedding, k=20)
keyword_results = keyword_db.search(query_text, k=20)

# Merge with reciprocal rank fusion
combined = reciprocal_rank_fusion(vector_results, keyword_results)
return combined[:5]

Often outperforms pure vector search.

Orchestration Patterns

Simple Chain

For straightforward tasks:

User Input → LLM Reasoning → Tool Call → Response

ReAct Pattern

Reasoning and acting interleaved:

while not done:
    thought = llm.think(observation)
    action = llm.decide_action(thought)
    observation = execute(action)
    if is_final_answer(observation):
        done = True

Good for multi-step reasoning with tools.

Plan-and-Execute

For complex tasks:

1. LLM creates full plan
2. Execute each step
3. Re-plan if needed

Better for longer tasks where upfront planning helps.

Multi-Agent

Specialised agents coordinating:

                    ┌─ Research Agent
Orchestrator Agent ─┼─ Writer Agent
                    └─ Reviewer Agent

Use when tasks have distinct phases requiring different capabilities.

Error Handling

Agents fail. Design for it.

Graceful Degradation

try:
    result = agent.run(task)
except ToolError as e:
    # Tool failed - acknowledge and try alternative
    return fallback_response(e)
except ModelError as e:
    # LLM failed - retry with backoff
    return retry_with_backoff(task)
except TimeoutError:
    # Taking too long - partial response + escalate
    return partial_response_with_escalation()

Confidence Calibration

Have the agent assess its own confidence:

Before taking action, rate your confidence 1-10.
Below 6: Ask for clarification
6-8: Proceed but note uncertainty
Above 8: Proceed confidently

Not perfect, but helps catch uncertain responses.

Observability

You cannot improve what you cannot observe.

Logging

Log everything:

  • Full prompts and responses
  • Tool calls and results
  • State changes
  • Timing information
  • Token counts

Metrics

Track:

  • Task completion rate
  • Average turns to completion
  • Tool use patterns
  • Error rates by type
  • Latency distribution
  • Cost per interaction

Debugging

Build tools for:

  • Replaying conversations
  • Comparing prompt variations
  • Tracing decision paths
  • Identifying failure patterns

Production Considerations

Rate Limiting

Protect against:

  • API cost explosions
  • Runaway agent loops
  • Abuse

Implement per-user, per-minute, and per-day limits.

Caching

Cache where possible:

  • Embedding calculations
  • Retrieval results
  • Common responses

Significantly reduces cost and latency.

Fallback Models

If primary model is unavailable:

models = [
    ("gpt-4-turbo", primary_client),
    ("claude-3-sonnet", fallback_client),
    ("llama-3-70b", local_client)
]

for model, client in models:
    try:
        return client.complete(prompt)
    except ServiceUnavailable:
        continue
raise AllModelsUnavailable()

Further Reading

This is necessarily a survey. For deeper dives:

  • LangChain / LlamaIndex docs for framework approaches
  • Anthropic's tool use documentation
  • OpenAI's function calling guide
  • Our architecture post for enterprise patterns

We build production AI agents with these patterns. Happy to discuss your technical challenges.

Contact us