Multi-Agent Systems with Microsoft AI Agent Framework
Single agents hit a ceiling. When a workflow requires different types of reasoning, access to different systems, or coordination across specialised tasks, one agent trying to do everything becomes unreliable. That's where multi-agent systems come in - and Microsoft's framework handles this better than most teams expect.
We've built multi-agent systems for document processing pipelines, IT service management, procurement workflows, and compliance automation. Here's how to do it properly using the Microsoft stack.
When You Actually Need Multiple Agents
Before designing a multi-agent system, make sure you actually need one. In our experience, about 60% of the projects that come to us asking for multi-agent systems would be better served by a single agent with well-designed tools.
You need multiple agents when:
- The workflow has distinct phases that require different reasoning approaches (e.g., extraction vs analysis vs generation)
- Different parts of the workflow need different security permissions (e.g., one agent reads sensitive data, another only sees anonymised summaries)
- The workflow benefits from specialist "experts" that are each optimised for a specific task
- You need to scale different parts of the workflow independently
- Failure in one part shouldn't bring down the entire workflow
A single agent is probably enough when:
- The workflow is sequential and straightforward
- All tasks require similar reasoning capabilities
- Security permissions are uniform across the workflow
- The total context (system prompt + tools + conversation) fits comfortably within the model's context window
Starting with a single agent and splitting into multiple agents when you hit specific limitations is a more reliable approach than starting with multi-agent and dealing with the coordination complexity from day one.
Multi-Agent Patterns in Semantic Kernel
Microsoft's Semantic Kernel provides three main patterns for multi-agent orchestration. Each suits different workflow shapes.
Pattern 1 - Sequential Pipeline
Agents execute in a fixed order. Each agent's output becomes the next agent's input.
Best for: Document processing, content production, staged analysis.
How it works:
Document -> [Extractor Agent] -> structured data -> [Analyst Agent] -> findings -> [Reporter Agent] -> final report
Each agent has a focused role:
- The Extractor Agent reads raw documents and produces structured data
- The Analyst Agent compares structured data against rules and finds anomalies
- The Reporter Agent generates human-readable reports from the analysis
In Semantic Kernel: You implement this using Agent Chat with a sequential selection strategy. Each agent takes a turn in order, processing the accumulated conversation context.
Advantages:
- Easy to understand and debug
- Each agent can be tested independently
- Failure is isolated - if the Analyst fails, you can retry just that step
- Clear data flow makes audit logging straightforward
Disadvantages:
- Not flexible - the order is fixed
- Bottlenecked on the slowest agent
- Adding a new step means modifying the pipeline
Real example: We built a compliance document processing pipeline for a construction company. The Extractor Agent pulls key data from safety inspection reports (dates, locations, findings, severity ratings). The Analyst Agent compares findings against the company's compliance standards and flags non-conformances. The Reporter Agent generates a summary for the safety manager with action items prioritised by severity. Processing time dropped from 4 hours per batch to 20 minutes.
Pattern 2 - Supervisor with Specialists
One agent (the supervisor) receives all incoming requests and delegates to specialist agents based on the request type.
Best for: Helpdesk systems, customer service, request routing.
How it works:
User request -> [Supervisor Agent] -> routes to appropriate specialist
|
|-> [Password Reset Agent]
|-> [Software Provisioning Agent]
|-> [Network Troubleshooting Agent]
|-> [Hardware Request Agent]
|-> [Human Escalation]
The Supervisor Agent understands what each specialist can do and routes accordingly. It doesn't try to handle anything itself - it classifies and delegates.
In Semantic Kernel: Use Agent Group Chat with a custom selection strategy. The selection strategy is a function that takes the current conversation context and returns which agent should respond next. The supervisor agent's turn always comes first, and its output includes a routing decision that your selection strategy uses.
Advantages:
- Mirrors how human teams work - specialists with a coordinator
- Each specialist agent is focused and highly optimised for its domain
- Adding new specialists is straightforward - define the agent and register it with the supervisor
- Easy to A/B test different specialist agents
Disadvantages:
- The supervisor is a single point of failure (mitigate with fallback logic)
- Routing accuracy depends on the supervisor's understanding of specialist capabilities
- Can have higher latency due to the routing step
Real example: We deployed a supervisor-specialist system for a professional services firm's internal IT helpdesk. The supervisor correctly routes 94% of requests to the right specialist on the first attempt. The specialist agents resolve 72% of requests without human intervention. The remaining 28% are escalated to human IT staff with full context, so the human doesn't start from scratch.
Pattern 3 - Collaborative Group
Multiple agents work together on the same task, each contributing their expertise in a conversation-like flow.
Best for: Complex analysis requiring multiple perspectives, review and refinement workflows.
How it works:
Task -> [Research Agent] <-> [Analysis Agent] <-> [Review Agent]
(gathers facts) (draws conclusions) (checks quality)
Agents take turns contributing to the task. The Research Agent gathers relevant information. The Analysis Agent draws conclusions from that information. The Review Agent checks the analysis for errors or gaps. They may go back and forth - the Review Agent might ask the Research Agent for additional data, or challenge the Analysis Agent's conclusions.
In Semantic Kernel: Use Agent Group Chat with either a round-robin selection strategy or a custom one that allows agents to dynamically request turns. Each agent sees the full conversation history and contributes based on its role.
Advantages:
- Produces higher quality results through iteration and peer review
- Models complex decision-making processes well
- Catches errors that a single agent would miss
Disadvantages:
- More expensive (multiple model calls per task)
- Harder to predict execution time
- Requires careful termination logic (when is the group "done"?)
- Debugging is more complex - you need to trace the conversation between agents
Real example: We built a collaborative group for proposal generation at a consulting firm. A Research Agent gathers relevant case studies and capabilities. A Writer Agent produces the initial draft. A Reviewer Agent checks for accuracy, completeness, and tone. The Writer can revise based on the Reviewer's feedback. Average output quality (measured by human reviewers) is 40% higher than single-agent generation, and 85% of proposals need only minor human editing.
Architecture Decisions That Matter
Shared vs Isolated Memory
Shared memory: All agents see the same conversation context. Simple to implement but means every agent processes the full context, increasing token usage and potentially confusing specialist agents with information they don't need.
Isolated memory with structured handoffs: Each agent has its own context and receives only the information it needs from the previous step. More complex to implement but more efficient and more secure (you can control what each agent sees).
Our recommendation: Use isolated memory with structured handoffs for production systems. The additional implementation effort pays off in better performance, lower costs, and clearer security boundaries.
Error Handling and Recovery
Multi-agent systems need explicit error handling at every level:
Agent-level: Each agent should handle its own errors - model API timeouts, tool failures, unexpected inputs. Return structured error information rather than letting exceptions propagate.
Pipeline-level: If one agent in a sequence fails, what happens? Options include:
- Retry the failed agent (usually the first thing to try)
- Skip the failed agent and continue with a degraded workflow
- Roll back the entire pipeline and alert a human
- Route to a fallback agent with simpler logic
System-level: If the entire multi-agent system is unavailable, what's the fallback? For customer-facing systems, this is usually "route all requests to human agents." Build and test this fallback before you go live.
Cost Management
Multi-agent systems multiply your model API costs. Every agent call consumes tokens. A three-agent pipeline processing one request might make three model calls, each with its own token consumption.
Cost optimisation strategies:
| Strategy | Savings | Trade-off |
|---|---|---|
| Use GPT-4o-mini for routing/classification agents | 50-80% on those agents | Slightly lower routing accuracy |
| Cache repeated intermediate results | 20-40% overall | Added complexity, stale cache risk |
| Minimise context passed between agents | 10-30% per agent | Need explicit data contracts |
| Batch processing for non-real-time workflows | 20-30% via Reserved Capacity pricing | Higher latency |
| Short-circuit pipelines when early agents produce high-confidence results | 15-25% overall | Need confidence scoring |
Example cost calculation: A three-agent document processing pipeline handling 500 documents per day:
- Agent 1 (Extractor, GPT-4o-mini): ~$15 AUD/month
- Agent 2 (Analyst, GPT-4o): ~$120 AUD/month
- Agent 3 (Reporter, GPT-4o): ~$80 AUD/month
- Total model costs: ~$215 AUD/month
That's significantly lower than most people expect. The compute and storage infrastructure typically costs more than the model API calls.
Observability
You need to trace requests through the entire multi-agent system. When something goes wrong, you need to see which agent failed, what input it received, and what it produced.
What to log:
- Request ID that follows the request through all agents
- Each agent's input, output, and execution time
- Tool calls made by each agent and their results
- Routing decisions (for supervisor patterns)
- Token usage per agent per request
In Azure: Application Insights with custom telemetry. Create a distributed trace that connects all agent calls for a single request. Use Azure Monitor workbooks to visualise agent performance and identify bottlenecks.
Building a Multi-Agent System - Step by Step
Here's the approach we use in our Microsoft AI consulting engagements:
Week 1-2: Design
- Map the workflow and identify where agent boundaries should be
- Define each agent's role, tools, and security permissions
- Design the orchestration pattern (sequential, supervisor, or collaborative)
- Define the data contracts between agents
- Plan error handling and fallback strategies
Week 3-4: Build individual agents
- Build and test each agent independently
- Each agent should work correctly in isolation before you connect them
- Write evaluation test sets for each agent
Week 5-6: Build orchestration
- Implement the orchestration pattern
- Connect agents with structured handoffs
- Build error handling at the pipeline level
- Implement observability and distributed tracing
Week 7-8: Integration testing
- Test end-to-end with real data
- Load test to understand performance characteristics
- Red team testing for security
- Fine-tune system prompts based on test results
Week 9-10: Deployment
- Deploy to staging with monitoring
- Run parallel with existing processes (if applicable)
- Gradual traffic migration
- Monitor and iterate
When Multi-Agent Gets Complex
Some honest observations from our multi-agent deployments:
Debugging is harder than you think. When a three-agent pipeline produces a wrong answer, figuring out which agent was responsible requires tracing through all three. Invest heavily in observability from day one.
Agent coordination has overhead. Every handoff between agents adds latency and token costs. A three-agent pipeline that takes 15 seconds per request might take 5 seconds if you could do it with one agent. Make sure the quality improvement justifies the overhead.
Testing scales non-linearly. With three agents and three possible failures each, you have 27 failure combinations. With five agents, it's 243. Plan your test strategy carefully and focus on the most likely failure modes.
Start simple, add agents when you have evidence. The best multi-agent systems we've built started as single agents and evolved into multi-agent systems when specific limitations became clear. The worst multi-agent systems we've seen were designed as multi-agent from day one based on theoretical architecture rather than practical experience.
Getting Started
If you're considering a multi-agent system, start with a clear understanding of the workflow and where the agent boundaries should be. We can help with architecture design, technology selection, and implementation.
Reach out to our team to discuss your multi-agent project. You can also explore our AI agent development services or read about building enterprise AI agents with Microsoft tools for more context on the Microsoft AI stack.