AI Agent Architecture: Patterns for Production Systems
Building a demo AI agent is easy. Building one that runs reliably in production is hard.
The difference isn't the AI—it's the architecture around it.
Here's what we've learned building AI agents for Australian businesses.
The Simple Agent Pattern
Most agents should start here:
User Input → Agent (LLM + Tools) → Response
↓
State/Memory
Components:
- Input processing: Parse and validate user intent
- Agent core: LLM with system prompt and available tools
- Tool layer: Functions the agent can call
- State management: Context across interactions
- Output formatting: Consistent response structure
This handles most business use cases. Don't over-architect until you need to.
Why Simple Agents Fail
Simple agents break when:
Context exceeds limits: Conversation history fills the context window
Tasks require coordination: Multiple steps with dependencies
Reliability becomes critical: Single failure point brings everything down
Scale requirements emerge: Can't parallelise effectively
When you hit these, you need more sophisticated patterns.
Pattern 1: Agent with Memory Tiers
For long-running conversations or persistent context:
┌─────────────────┐
User Input ─────────│ Agent │─────── Response
│ │
│ Working Memory │ ← Current conversation
│ (context) │
│ ↓ │
│ Short-term │ ← Recent relevant history
│ (summary) │
│ ↓ │
│ Long-term │ ← Vector DB / persistent
│ (retrieval) │
└─────────────────┘
Working memory: Current context (what the LLM sees) Short-term: Summarised recent interactions (hours/days) Long-term: Searchable history, retrieved on relevance
This lets agents maintain context across sessions without context limits.
Pattern 2: Router Agent
For handling varied request types:
┌───────────────┐
┌───│ Scheduling │
│ │ Agent │
User ──→ Router ────┼───│ │
Agent │ └───────────────┘
│ ┌───────────────┐
├───│ FAQ │
│ │ Agent │
│ └───────────────┘
│ ┌───────────────┐
└───│ Escalation │
│ Handler │
└───────────────┘
Router agent's job:
- Classify intent
- Route to appropriate specialist agent
- Handle cases that don't fit cleanly
Specialist agents:
- Focused system prompt
- Specific tools for their domain
- Optimised for one task type
This works better than one agent trying to do everything.
Pattern 3: Multi-Step Orchestrator
For complex tasks requiring multiple operations:
┌─────────────────────────┐
│ Orchestrator Agent │
│ │
│ Plan: [Step1, Step2, │
│ Step3, Step4] │
│ │
│ Current: Step 2 │
└───────────┬─────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ Step 1 │ │ Step 2 │ │ Step 3 │
│ Complete │ │ Running │ │ Pending │
└──────────┘ └──────────┘ └──────────┘
The orchestrator:
- Breaks complex requests into steps
- Tracks progress through steps
- Handles failures and retries
- Reports status
Each step can be:
- An LLM call
- A tool execution
- A sub-agent invocation
Use this for workflows like order processing, document workflows, or approval chains.
Pattern 4: Supervisor with Workers
For parallel processing or redundant execution:
┌─────────────────┐
│ Supervisor │
│ │
│ Distributes │
│ Aggregates │
│ Validates │
└────────┬────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ Worker 1 │ │ Worker 2 │ │ Worker 3 │
│ │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘
Use cases:
- Parallel document processing
- Consensus-based decisions (multiple perspectives)
- Redundancy for reliability
- Scaling throughput
The supervisor handles distribution, aggregation, and quality control.
Error Handling Patterns
Graceful Degradation
When AI fails, have fallbacks:
async def handle_request(request):
try:
# Primary: Full AI handling
return await ai_agent.process(request)
except AIUnavailableError:
# Fallback 1: Simpler AI model
return await simple_model.process(request)
except:
# Fallback 2: Rule-based response
return rule_based_handler(request)
Retry with Backoff
LLM calls fail. Retry sensibly:
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
async def call_llm(prompt):
return await llm.generate(prompt)
Human Escalation
Know when to give up:
if confidence < 0.7 or attempts > 3:
return escalate_to_human(
context=conversation_history,
reason="Low confidence response"
)
State Management
Agents need state. Where to keep it:
In-memory: Fast, loses on restart. Good for development.
Cache (Redis): Fast, survives restarts. Good for session state.
Database: Slower, persistent. Good for conversation history.
Vector DB: For semantic retrieval. Good for long-term memory.
Typical production setup:
- Redis for active session state
- Postgres for conversation logs
- Pinecone/Weaviate for memory retrieval
Monitoring and Observability
Production agents need visibility:
Log everything:
- Input received
- Agent reasoning (if available)
- Tools called
- Output generated
- Time taken
- Errors encountered
Track metrics:
- Response latency (p50, p95, p99)
- Success rate
- Escalation rate
- Tool call patterns
- Cost per request
Alert on anomalies:
- Latency spikes
- Error rate increases
- Unusual patterns
- Cost overruns
We've written about monitoring AI agents in detail.
Testing Strategies
Unit Tests for Tools
Tools should work independently:
def test_schedule_appointment():
result = schedule_tool(date="2026-01-20", time="10:00")
assert result.success
assert result.appointment_id is not None
Integration Tests for Agent Flows
Test complete scenarios:
def test_booking_flow():
agent = TestAgent()
response = agent.process("Book an appointment for Monday")
assert "available times" in response
response = agent.process("10am please")
assert "confirmed" in response.lower()
Evaluation Sets
Build test datasets with expected outputs. Run regularly to catch regressions.
Scaling Considerations
Horizontal Scaling
Agents are stateless (state lives elsewhere). Scale by adding instances.
Queue-Based Processing
For async workloads:
Input Queue → Worker Pool → Output Queue
↓
Agent Instances
Cost Management
LLM calls cost money. Control it:
- Caching for identical queries
- Smaller models for simple tasks
- Rate limiting per user/client
- Cost alerts
Our Approach
When we build AI agents, architecture decisions depend on:
- Complexity: Simple use cases get simple architecture
- Reliability requirements: Critical systems get redundancy
- Scale expectations: High volume gets queue-based processing
- Integration needs: Deep integration drives architectural choices
Start simple. Add complexity when requirements demand it.
We work with Australian businesses on agent architecture. Get in touch to discuss your use case.