Multi-Agent Systems with Microsoft AI Agent Framework

April 23, 2026•10 min read•Michael Ridland

Single agents hit a ceiling. When a workflow requires different types of reasoning, access to different systems, or coordination across specialised tasks, one agent trying to do everything becomes unreliable. That's where multi-agent systems come in - and Microsoft's framework handles this better than most teams expect.

We've built multi-agent systems for document processing pipelines, IT service management, procurement workflows, and compliance automation. Here's how to do it properly using the Microsoft stack.

When You Actually Need Multiple Agents

Before designing a multi-agent system, make sure you actually need one. In our experience, about 60% of the projects that come to us asking for multi-agent systems would be better served by a single agent with well-designed tools.

You need multiple agents when:

The workflow has distinct phases that require different reasoning approaches (e.g., extraction vs analysis vs generation)
Different parts of the workflow need different security permissions (e.g., one agent reads sensitive data, another only sees anonymised summaries)
The workflow benefits from specialist "experts" that are each optimised for a specific task
You need to scale different parts of the workflow independently
Failure in one part shouldn't bring down the entire workflow

A single agent is probably enough when:

The workflow is sequential and straightforward
All tasks require similar reasoning capabilities
Security permissions are uniform across the workflow
The total context (system prompt + tools + conversation) fits comfortably within the model's context window

Starting with a single agent and splitting into multiple agents when you hit specific limitations is a more reliable approach than starting with multi-agent and dealing with the coordination complexity from day one.

Multi-Agent Patterns in Semantic Kernel

Microsoft's Semantic Kernel provides three main patterns for multi-agent orchestration. Each suits different workflow shapes.

Pattern 1 - Sequential Pipeline

Agents execute in a fixed order. Each agent's output becomes the next agent's input.

Best for: Document processing, content production, staged analysis.

How it works:

Document -> [Extractor Agent] -> structured data -> [Analyst Agent] -> findings -> [Reporter Agent] -> final report

Each agent has a focused role:

The Extractor Agent reads raw documents and produces structured data
The Analyst Agent compares structured data against rules and finds anomalies
The Reporter Agent generates human-readable reports from the analysis

In Semantic Kernel: You implement this using Agent Chat with a sequential selection strategy. Each agent takes a turn in order, processing the accumulated conversation context.

Advantages:

Easy to understand and debug
Each agent can be tested independently
Failure is isolated - if the Analyst fails, you can retry just that step
Clear data flow makes audit logging straightforward

Disadvantages:

Not flexible - the order is fixed
Bottlenecked on the slowest agent
Adding a new step means modifying the pipeline

Real example: We built a compliance document processing pipeline for a construction company. The Extractor Agent pulls key data from safety inspection reports (dates, locations, findings, severity ratings). The Analyst Agent compares findings against the company's compliance standards and flags non-conformances. The Reporter Agent generates a summary for the safety manager with action items prioritised by severity. Processing time dropped from 4 hours per batch to 20 minutes.

Pattern 2 - Supervisor with Specialists

One agent (the supervisor) receives all incoming requests and delegates to specialist agents based on the request type.

Best for: Helpdesk systems, customer service, request routing.

How it works:

User request -> [Supervisor Agent] -> routes to appropriate specialist
                    |
                    |-> [Password Reset Agent]
                    |-> [Software Provisioning Agent]
                    |-> [Network Troubleshooting Agent]
                    |-> [Hardware Request Agent]
                    |-> [Human Escalation]

The Supervisor Agent understands what each specialist can do and routes accordingly. It doesn't try to handle anything itself - it classifies and delegates.

In Semantic Kernel: Use Agent Group Chat with a custom selection strategy. The selection strategy is a function that takes the current conversation context and returns which agent should respond next. The supervisor agent's turn always comes first, and its output includes a routing decision that your selection strategy uses.

Advantages:

Mirrors how human teams work - specialists with a coordinator
Each specialist agent is focused and highly optimised for its domain
Adding new specialists is straightforward - define the agent and register it with the supervisor
Easy to A/B test different specialist agents

Disadvantages:

The supervisor is a single point of failure (mitigate with fallback logic)
Routing accuracy depends on the supervisor's understanding of specialist capabilities
Can have higher latency due to the routing step

Real example: We deployed a supervisor-specialist system for a professional services firm's internal IT helpdesk. The supervisor correctly routes 94% of requests to the right specialist on the first attempt. The specialist agents resolve 72% of requests without human intervention. The remaining 28% are escalated to human IT staff with full context, so the human doesn't start from scratch.

Pattern 3 - Collaborative Group

Multiple agents work together on the same task, each contributing their expertise in a conversation-like flow.

Best for: Complex analysis requiring multiple perspectives, review and refinement workflows.

How it works:

Task -> [Research Agent] <-> [Analysis Agent] <-> [Review Agent]
        (gathers facts)     (draws conclusions)    (checks quality)

Agents take turns contributing to the task. The Research Agent gathers relevant information. The Analysis Agent draws conclusions from that information. The Review Agent checks the analysis for errors or gaps. They may go back and forth - the Review Agent might ask the Research Agent for additional data, or challenge the Analysis Agent's conclusions.

In Semantic Kernel: Use Agent Group Chat with either a round-robin selection strategy or a custom one that allows agents to dynamically request turns. Each agent sees the full conversation history and contributes based on its role.

Advantages:

Produces higher quality results through iteration and peer review
Models complex decision-making processes well
Catches errors that a single agent would miss

Disadvantages:

More expensive (multiple model calls per task)
Harder to predict execution time
Requires careful termination logic (when is the group "done"?)
Debugging is more complex - you need to trace the conversation between agents

Real example: We built a collaborative group for proposal generation at a consulting firm. A Research Agent gathers relevant case studies and capabilities. A Writer Agent produces the initial draft. A Reviewer Agent checks for accuracy, completeness, and tone. The Writer can revise based on the Reviewer's feedback. Average output quality (measured by human reviewers) is 40% higher than single-agent generation, and 85% of proposals need only minor human editing.

Architecture Decisions That Matter

Shared vs Isolated Memory

Shared memory: All agents see the same conversation context. Simple to implement but means every agent processes the full context, increasing token usage and potentially confusing specialist agents with information they don't need.

Isolated memory with structured handoffs: Each agent has its own context and receives only the information it needs from the previous step. More complex to implement but more efficient and more secure (you can control what each agent sees).

Our recommendation: Use isolated memory with structured handoffs for production systems. The additional implementation effort pays off in better performance, lower costs, and clearer security boundaries.

Error Handling and Recovery

Multi-agent systems need explicit error handling at every level:

Agent-level: Each agent should handle its own errors - model API timeouts, tool failures, unexpected inputs. Return structured error information rather than letting exceptions propagate.

Pipeline-level: If one agent in a sequence fails, what happens? Options include:

Retry the failed agent (usually the first thing to try)
Skip the failed agent and continue with a degraded workflow
Roll back the entire pipeline and alert a human
Route to a fallback agent with simpler logic

System-level: If the entire multi-agent system is unavailable, what's the fallback? For customer-facing systems, this is usually "route all requests to human agents." Build and test this fallback before you go live.

Cost Management

Multi-agent systems multiply your model API costs. Every agent call consumes tokens. A three-agent pipeline processing one request might make three model calls, each with its own token consumption.

Cost optimisation strategies:

Strategy	Savings	Trade-off
Use GPT-4o-mini for routing/classification agents	50-80% on those agents	Slightly lower routing accuracy
Cache repeated intermediate results	20-40% overall	Added complexity, stale cache risk
Minimise context passed between agents	10-30% per agent	Need explicit data contracts
Batch processing for non-real-time workflows	20-30% via Reserved Capacity pricing	Higher latency
Short-circuit pipelines when early agents produce high-confidence results	15-25% overall	Need confidence scoring

Example cost calculation: A three-agent document processing pipeline handling 500 documents per day:

Agent 1 (Extractor, GPT-4o-mini): ~$15 AUD/month
Agent 2 (Analyst, GPT-4o): ~$120 AUD/month
Agent 3 (Reporter, GPT-4o): ~$80 AUD/month
Total model costs: ~$215 AUD/month

That's significantly lower than most people expect. The compute and storage infrastructure typically costs more than the model API calls.

Observability

You need to trace requests through the entire multi-agent system. When something goes wrong, you need to see which agent failed, what input it received, and what it produced.

What to log:

Request ID that follows the request through all agents
Each agent's input, output, and execution time
Tool calls made by each agent and their results
Routing decisions (for supervisor patterns)
Token usage per agent per request

In Azure: Application Insights with custom telemetry. Create a distributed trace that connects all agent calls for a single request. Use Azure Monitor workbooks to visualise agent performance and identify bottlenecks.

Building a Multi-Agent System - Step by Step

Here's the approach we use in our Microsoft AI consulting engagements:

Week 1-2: Design

Map the workflow and identify where agent boundaries should be
Define each agent's role, tools, and security permissions
Design the orchestration pattern (sequential, supervisor, or collaborative)
Define the data contracts between agents
Plan error handling and fallback strategies

Week 3-4: Build individual agents

Build and test each agent independently
Each agent should work correctly in isolation before you connect them
Write evaluation test sets for each agent

Week 5-6: Build orchestration

Implement the orchestration pattern
Connect agents with structured handoffs
Build error handling at the pipeline level
Implement observability and distributed tracing

Week 7-8: Integration testing

Test end-to-end with real data
Load test to understand performance characteristics
Red team testing for security
Fine-tune system prompts based on test results

Week 9-10: Deployment

Deploy to staging with monitoring
Run parallel with existing processes (if applicable)
Gradual traffic migration
Monitor and iterate

When Multi-Agent Gets Complex

Some honest observations from our multi-agent deployments:

Debugging is harder than you think. When a three-agent pipeline produces a wrong answer, figuring out which agent was responsible requires tracing through all three. Invest heavily in observability from day one.

Agent coordination has overhead. Every handoff between agents adds latency and token costs. A three-agent pipeline that takes 15 seconds per request might take 5 seconds if you could do it with one agent. Make sure the quality improvement justifies the overhead.

Testing scales non-linearly. With three agents and three possible failures each, you have 27 failure combinations. With five agents, it's 243. Plan your test strategy carefully and focus on the most likely failure modes.

Start simple, add agents when you have evidence. The best multi-agent systems we've built started as single agents and evolved into multi-agent systems when specific limitations became clear. The worst multi-agent systems we've seen were designed as multi-agent from day one based on theoretical architecture rather than practical experience.

Getting Started

If you're considering a multi-agent system, start with a clear understanding of the workflow and where the agent boundaries should be. We can help with architecture design, technology selection, and implementation.

Reach out to our team to discuss your multi-agent project. You can also explore our AI agent development services or read about building enterprise AI agents with Microsoft tools for more context on the Microsoft AI stack.