LLM-Powered Agents: Technical Deep Dive

May 21, 2025•7 min read•Team 400

This post is technical. If you're a business stakeholder wanting to understand AI agents, read our non-technical guide instead.

Still here? Good. Let's get into the details of how LLM-powered agents actually work.

The Basic Agent Loop

At its core, an LLM agent is a loop:

while not task_complete:
    1. Observe: Gather current state and context
    2. Reason: LLM decides what to do next
    3. Act: Execute the chosen action
    4. Evaluate: Check if goal is achieved

The sophistication comes from how each step is implemented.

Model Selection

Not all LLMs are equal for agent tasks. Considerations:

Reasoning Capability

Agents need models that can:

Follow multi-step instructions
Use tools correctly
Know when they don't know something
Recover from errors

As of early 2025, our typical choices:

GPT-4 / GPT-4 Turbo: Strong reasoning, good tool use, expensive
Claude 3 Opus/Sonnet: Excellent reasoning, better at nuance, good safety properties
Llama 3 70B: Capable, can self-host, lower inference cost
Mistral Large: Good balance of capability and cost

We avoid GPT-3.5 and smaller models for autonomous agent tasks. The error rate is too high.

Context Window

Agents often need long context:

Conversation history
Retrieved documents
Tool definitions
System instructions

GPT-4 Turbo's 128K context and Claude's 200K context are game-changers. You can fit substantial knowledge without aggressive truncation.

Tool Use Support

Native function calling is now standard:

OpenAI function calling
Anthropic tool use
Structured output modes

These are more reliable than prompting the model to output JSON. Use them.

Cost at Scale

At high volume, cost matters. Rough pricing (early 2025):

GPT-4 Turbo: ~$10-30 per 1M tokens
Claude Sonnet: ~$3-15 per 1M tokens
Self-hosted Llama: Infrastructure cost only

For an agent handling 10K interactions/month with average 4K tokens each, that's:

GPT-4 Turbo: ~$400-1200/month on API alone
Self-hosted: ~$200-500/month on infrastructure

Worth modelling before committing.

Prompt Architecture

The system prompt is the foundation. Structure it carefully.

Core Components

# Role
You are [agent name], an AI assistant that [core purpose].

# Capabilities
You have access to these tools:
- [tool 1]: [description, when to use]
- [tool 2]: [description, when to use]

# Constraints
- Never [hard boundaries]
- Always [required behaviors]
- When uncertain, [fallback behavior]

# Process
1. [Step-by-step workflow]
2. [Decision points]
3. [Escalation criteria]

# Communication Style
- [Tone guidelines]
- [Formatting preferences]

Dynamic Context

Added to each request:

Current conversation history
Retrieved relevant information
Current state/variables
User context (if available)

Token Budget Management

With limited context windows, you need strategies:

Conversation summarisation: Periodically compress old conversation turns Sliding window: Keep N recent turns, summarise the rest Selective retrieval: Only include retrieved docs that score above threshold Dynamic few-shot: Include examples relevant to current query

Tool Design

Tools are how agents interact with the world.

Tool Definition

Each tool needs:

{
  "name": "update_customer_address",
  "description": "Updates a customer's shipping address. Use when customer requests address change.",
  "parameters": {
    "type": "object",
    "properties": {
      "customer_id": {
        "type": "string",
        "description": "The unique customer identifier"
      },
      "new_address": {
        "type": "object",
        "properties": {
          "street": {"type": "string"},
          "city": {"type": "string"},
          "state": {"type": "string"},
          "postcode": {"type": "string"}
        },
        "required": ["street", "city", "state", "postcode"]
      }
    },
    "required": ["customer_id", "new_address"]
  }
}

Key principles:

Clear, specific descriptions
Strongly typed parameters
Validation constraints where possible
Examples in description if helpful

Tool Execution Layer

The tool gateway handles:

def execute_tool(tool_name: str, parameters: dict) -> ToolResult:
    # 1. Validate parameters
    validate_parameters(tool_name, parameters)

    # 2. Check permissions
    check_permissions(current_user, tool_name, parameters)

    # 3. Log the attempt
    log_tool_call(tool_name, parameters)

    # 4. Execute with timeout
    with timeout(TOOL_TIMEOUT):
        result = tools[tool_name].execute(parameters)

    # 5. Log the result
    log_tool_result(tool_name, result)

    # 6. Return structured result
    return ToolResult(
        success=result.success,
        data=result.data,
        error=result.error
    )

Never let the LLM execute arbitrary code or call APIs directly.

Memory Systems

Agents need memory beyond single interactions.

Short-Term Memory

The current conversation. Usually kept in full within context window.

Working Memory

Task-specific state during a multi-turn interaction:

class ConversationState:
    intent: str
    collected_data: dict
    steps_completed: list
    pending_actions: list
    confidence: float

Persisted between turns, included in prompt.

Long-Term Memory

Persistent information about users, preferences, past interactions:

# Retrieve relevant memory
memories = memory_store.search(
    user_id=user.id,
    query=current_query,
    limit=5,
    recency_weight=0.3
)

# Include in prompt
context += format_memories(memories)

Vector databases (Pinecone, Weaviate, Qdrant) work well for this.

Memory Updates

After each interaction:

# Extract memorable information
important_info = extract_memories(conversation)

# Store with metadata
for info in important_info:
    memory_store.add(
        user_id=user.id,
        content=info.content,
        embedding=embed(info.content),
        metadata={
            "timestamp": now(),
            "conversation_id": conv_id,
            "type": info.type
        }
    )

Be selective. Not everything should be remembered.

Retrieval Architecture

RAG (Retrieval-Augmented Generation) gives agents access to knowledge.

Indexing Pipeline

Documents → Chunking → Embedding → Vector Store
                ↓
          Metadata Extraction

Chunking strategy matters:

Too small: loses context
Too large: dilutes relevance
Overlapping windows often help

We typically use 500-1000 token chunks with 100 token overlap.

Retrieval Pipeline

Query → Query Embedding → Vector Search → Reranking → Selection
                                              ↓
                                    Diversity Filter
                                              ↓
                                    Recency Filter

Don't just take top-K by similarity. Consider:

Result diversity (avoid redundancy)
Source authority (official docs > wiki > emails)
Recency (for time-sensitive info)

Hybrid Search

Combine vector search with keyword search:

vector_results = vector_db.search(query_embedding, k=20)
keyword_results = keyword_db.search(query_text, k=20)

# Merge with reciprocal rank fusion
combined = reciprocal_rank_fusion(vector_results, keyword_results)
return combined[:5]

Often outperforms pure vector search.

Orchestration Patterns

Simple Chain

For straightforward tasks:

User Input → LLM Reasoning → Tool Call → Response

ReAct Pattern

Reasoning and acting interleaved:

while not done:
    thought = llm.think(observation)
    action = llm.decide_action(thought)
    observation = execute(action)
    if is_final_answer(observation):
        done = True

Good for multi-step reasoning with tools.

Plan-and-Execute

For complex tasks:

1. LLM creates full plan
2. Execute each step
3. Re-plan if needed

Better for longer tasks where upfront planning helps.

Multi-Agent

Specialised agents coordinating:

                    ┌─ Research Agent
Orchestrator Agent ─┼─ Writer Agent
                    └─ Reviewer Agent

Use when tasks have distinct phases requiring different capabilities.

Error Handling

Agents fail. Design for it.

Graceful Degradation

try:
    result = agent.run(task)
except ToolError as e:
    # Tool failed - acknowledge and try alternative
    return fallback_response(e)
except ModelError as e:
    # LLM failed - retry with backoff
    return retry_with_backoff(task)
except TimeoutError:
    # Taking too long - partial response + escalate
    return partial_response_with_escalation()

Confidence Calibration

Have the agent assess its own confidence:

Before taking action, rate your confidence 1-10.
Below 6: Ask for clarification
6-8: Proceed but note uncertainty
Above 8: Proceed confidently

Not perfect, but helps catch uncertain responses.

Observability

You cannot improve what you cannot observe.

Logging

Log everything:

Full prompts and responses
Tool calls and results
State changes
Timing information
Token counts

Metrics

Track:

Task completion rate
Average turns to completion
Tool use patterns
Error rates by type
Latency distribution
Cost per interaction

Debugging

Build tools for:

Replaying conversations
Comparing prompt variations
Tracing decision paths
Identifying failure patterns

Production Considerations

Rate Limiting

Protect against:

API cost explosions
Runaway agent loops
Abuse

Implement per-user, per-minute, and per-day limits.

Caching

Cache where possible:

Embedding calculations
Retrieval results
Common responses

Significantly reduces cost and latency.

Fallback Models

If primary model is unavailable:

models = [
    ("gpt-4-turbo", primary_client),
    ("claude-3-sonnet", fallback_client),
    ("llama-3-70b", local_client)
]

for model, client in models:
    try:
        return client.complete(prompt)
    except ServiceUnavailable:
        continue
raise AllModelsUnavailable()