LLM-Powered Agents: Technical Deep Dive
This post is technical. If you're a business stakeholder wanting to understand AI agents, read our non-technical guide instead.
Still here? Good. Let's get into the details of how LLM-powered agents actually work.
The Basic Agent Loop
At its core, an LLM agent is a loop:
while not task_complete:
1. Observe: Gather current state and context
2. Reason: LLM decides what to do next
3. Act: Execute the chosen action
4. Evaluate: Check if goal is achieved
The sophistication comes from how each step is implemented.
Model Selection
Not all LLMs are equal for agent tasks. Considerations:
Reasoning Capability
Agents need models that can:
- Follow multi-step instructions
- Use tools correctly
- Know when they don't know something
- Recover from errors
As of early 2025, our typical choices:
- GPT-4 / GPT-4 Turbo: Strong reasoning, good tool use, expensive
- Claude 3 Opus/Sonnet: Excellent reasoning, better at nuance, good safety properties
- Llama 3 70B: Capable, can self-host, lower inference cost
- Mistral Large: Good balance of capability and cost
We avoid GPT-3.5 and smaller models for autonomous agent tasks. The error rate is too high.
Context Window
Agents often need long context:
- Conversation history
- Retrieved documents
- Tool definitions
- System instructions
GPT-4 Turbo's 128K context and Claude's 200K context are game-changers. You can fit substantial knowledge without aggressive truncation.
Tool Use Support
Native function calling is now standard:
- OpenAI function calling
- Anthropic tool use
- Structured output modes
These are more reliable than prompting the model to output JSON. Use them.
Cost at Scale
At high volume, cost matters. Rough pricing (early 2025):
- GPT-4 Turbo: ~$10-30 per 1M tokens
- Claude Sonnet: ~$3-15 per 1M tokens
- Self-hosted Llama: Infrastructure cost only
For an agent handling 10K interactions/month with average 4K tokens each, that's:
- GPT-4 Turbo: ~$400-1200/month on API alone
- Self-hosted: ~$200-500/month on infrastructure
Worth modelling before committing.
Prompt Architecture
The system prompt is the foundation. Structure it carefully.
Core Components
# Role
You are [agent name], an AI assistant that [core purpose].
# Capabilities
You have access to these tools:
- [tool 1]: [description, when to use]
- [tool 2]: [description, when to use]
# Constraints
- Never [hard boundaries]
- Always [required behaviors]
- When uncertain, [fallback behavior]
# Process
1. [Step-by-step workflow]
2. [Decision points]
3. [Escalation criteria]
# Communication Style
- [Tone guidelines]
- [Formatting preferences]
Dynamic Context
Added to each request:
- Current conversation history
- Retrieved relevant information
- Current state/variables
- User context (if available)
Token Budget Management
With limited context windows, you need strategies:
Conversation summarisation: Periodically compress old conversation turns Sliding window: Keep N recent turns, summarise the rest Selective retrieval: Only include retrieved docs that score above threshold Dynamic few-shot: Include examples relevant to current query
Tool Design
Tools are how agents interact with the world.
Tool Definition
Each tool needs:
{
"name": "update_customer_address",
"description": "Updates a customer's shipping address. Use when customer requests address change.",
"parameters": {
"type": "object",
"properties": {
"customer_id": {
"type": "string",
"description": "The unique customer identifier"
},
"new_address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"state": {"type": "string"},
"postcode": {"type": "string"}
},
"required": ["street", "city", "state", "postcode"]
}
},
"required": ["customer_id", "new_address"]
}
}
Key principles:
- Clear, specific descriptions
- Strongly typed parameters
- Validation constraints where possible
- Examples in description if helpful
Tool Execution Layer
The tool gateway handles:
def execute_tool(tool_name: str, parameters: dict) -> ToolResult:
# 1. Validate parameters
validate_parameters(tool_name, parameters)
# 2. Check permissions
check_permissions(current_user, tool_name, parameters)
# 3. Log the attempt
log_tool_call(tool_name, parameters)
# 4. Execute with timeout
with timeout(TOOL_TIMEOUT):
result = tools[tool_name].execute(parameters)
# 5. Log the result
log_tool_result(tool_name, result)
# 6. Return structured result
return ToolResult(
success=result.success,
data=result.data,
error=result.error
)
Never let the LLM execute arbitrary code or call APIs directly.
Memory Systems
Agents need memory beyond single interactions.
Short-Term Memory
The current conversation. Usually kept in full within context window.
Working Memory
Task-specific state during a multi-turn interaction:
class ConversationState:
intent: str
collected_data: dict
steps_completed: list
pending_actions: list
confidence: float
Persisted between turns, included in prompt.
Long-Term Memory
Persistent information about users, preferences, past interactions:
# Retrieve relevant memory
memories = memory_store.search(
user_id=user.id,
query=current_query,
limit=5,
recency_weight=0.3
)
# Include in prompt
context += format_memories(memories)
Vector databases (Pinecone, Weaviate, Qdrant) work well for this.
Memory Updates
After each interaction:
# Extract memorable information
important_info = extract_memories(conversation)
# Store with metadata
for info in important_info:
memory_store.add(
user_id=user.id,
content=info.content,
embedding=embed(info.content),
metadata={
"timestamp": now(),
"conversation_id": conv_id,
"type": info.type
}
)
Be selective. Not everything should be remembered.
Retrieval Architecture
RAG (Retrieval-Augmented Generation) gives agents access to knowledge.
Indexing Pipeline
Documents → Chunking → Embedding → Vector Store
↓
Metadata Extraction
Chunking strategy matters:
- Too small: loses context
- Too large: dilutes relevance
- Overlapping windows often help
We typically use 500-1000 token chunks with 100 token overlap.
Retrieval Pipeline
Query → Query Embedding → Vector Search → Reranking → Selection
↓
Diversity Filter
↓
Recency Filter
Don't just take top-K by similarity. Consider:
- Result diversity (avoid redundancy)
- Source authority (official docs > wiki > emails)
- Recency (for time-sensitive info)
Hybrid Search
Combine vector search with keyword search:
vector_results = vector_db.search(query_embedding, k=20)
keyword_results = keyword_db.search(query_text, k=20)
# Merge with reciprocal rank fusion
combined = reciprocal_rank_fusion(vector_results, keyword_results)
return combined[:5]
Often outperforms pure vector search.
Orchestration Patterns
Simple Chain
For straightforward tasks:
User Input → LLM Reasoning → Tool Call → Response
ReAct Pattern
Reasoning and acting interleaved:
while not done:
thought = llm.think(observation)
action = llm.decide_action(thought)
observation = execute(action)
if is_final_answer(observation):
done = True
Good for multi-step reasoning with tools.
Plan-and-Execute
For complex tasks:
1. LLM creates full plan
2. Execute each step
3. Re-plan if needed
Better for longer tasks where upfront planning helps.
Multi-Agent
Specialised agents coordinating:
┌─ Research Agent
Orchestrator Agent ─┼─ Writer Agent
└─ Reviewer Agent
Use when tasks have distinct phases requiring different capabilities.
Error Handling
Agents fail. Design for it.
Graceful Degradation
try:
result = agent.run(task)
except ToolError as e:
# Tool failed - acknowledge and try alternative
return fallback_response(e)
except ModelError as e:
# LLM failed - retry with backoff
return retry_with_backoff(task)
except TimeoutError:
# Taking too long - partial response + escalate
return partial_response_with_escalation()
Confidence Calibration
Have the agent assess its own confidence:
Before taking action, rate your confidence 1-10.
Below 6: Ask for clarification
6-8: Proceed but note uncertainty
Above 8: Proceed confidently
Not perfect, but helps catch uncertain responses.
Observability
You cannot improve what you cannot observe.
Logging
Log everything:
- Full prompts and responses
- Tool calls and results
- State changes
- Timing information
- Token counts
Metrics
Track:
- Task completion rate
- Average turns to completion
- Tool use patterns
- Error rates by type
- Latency distribution
- Cost per interaction
Debugging
Build tools for:
- Replaying conversations
- Comparing prompt variations
- Tracing decision paths
- Identifying failure patterns
Production Considerations
Rate Limiting
Protect against:
- API cost explosions
- Runaway agent loops
- Abuse
Implement per-user, per-minute, and per-day limits.
Caching
Cache where possible:
- Embedding calculations
- Retrieval results
- Common responses
Significantly reduces cost and latency.
Fallback Models
If primary model is unavailable:
models = [
("gpt-4-turbo", primary_client),
("claude-3-sonnet", fallback_client),
("llama-3-70b", local_client)
]
for model, client in models:
try:
return client.complete(prompt)
except ServiceUnavailable:
continue
raise AllModelsUnavailable()
Further Reading
This is necessarily a survey. For deeper dives:
- LangChain / LlamaIndex docs for framework approaches
- Anthropic's tool use documentation
- OpenAI's function calling guide
- Our architecture post for enterprise patterns
We build production AI agents with these patterns. Happy to discuss your technical challenges.