Multi-Agent Systems with Microsoft AI Agent Framework - Production Patterns
Microsoft consolidated Semantic Kernel and AutoGen into the AI Agent Framework about a year ago. The dust has settled. We've shipped enough production systems on it now to have strong opinions about what works and what's still painful.
This post is for engineering leaders and architects who are past the "should we build agents" question and have moved to "how do we build them properly". If you're still evaluating frameworks, our earlier comparison piece covers that ground. This one is about the actual production work.
What the framework gives you that matters
The AI Agent Framework is essentially Microsoft's bet that most enterprises will want agents inside .NET shops, with strong Azure integration, observability through OpenTelemetry, and orchestration patterns you don't have to build yourself. That last point is the one that actually matters.
Most teams I talk to about multi-agent systems initially think they're going to roll their own orchestration. Some sort of message bus, agents subscribe to topics, a coordinator decides who runs next. It looks reasonable on a whiteboard. It is almost always a mistake. The orchestration patterns built into the framework cover roughly 90% of what production systems need, and rolling your own gives you debugging headaches that take months to surface.
The patterns worth knowing are sequential, concurrent, handoff, group chat, and magentic. Each one fits a different shape of problem. Most teams pick the wrong one for their use case the first time around.
Sequential is boring and underrated
A sequential orchestration runs agents one after another, passing the previous agent's output to the next. It's the simplest pattern and most teams skip past it looking for something more interesting. Don't.
About 40% of our production agent work uses sequential orchestration. Document processing pipelines where one agent extracts, the next classifies, the third summarises. Customer onboarding where identity verification feeds into KYC checks, which feeds into account setup. The pattern matches the workflow naturally and you get linear cost behaviour that finance teams can actually budget for.
The trap with sequential is putting too many agents in the chain. Each step adds latency, adds failure surface, and adds cost. We've reviewed designs with twelve agents in a sequential pipeline where four would have done the same job. The temptation is to give each agent one narrow responsibility because that feels "clean". In practice, an agent with three or four well-defined tools is more reliable than three agents each doing one thing.
Group chat works for diverse expertise
Group chat patterns put multiple agents in a shared conversation with a chat manager deciding who speaks next. This is where the framework genuinely earns its keep, because building this from scratch is fiddly.
We use group chat when the workflow needs back-and-forth between specialists. One client has a contract review system with a legal agent, a commercial terms agent, and a risk agent. They all read the same contract, raise concerns, debate them, and produce a consolidated review. The chat manager prompts each agent based on what's still unresolved. The output is meaningfully better than running them sequentially because the agents can challenge each other.
The cost surprise with group chat is real. A three-agent group chat with five rounds of conversation can easily consume 50,000 tokens per contract. At Australian enterprise contract volumes, this adds up fast. We had one client doing 800 contracts a month, and the initial implementation was running about $4,800 AUD per month just on the group chat tokens. Caching the legal precedent context across calls and pruning the conversation history aggressively brought that down to around $1,200. The point being, group chat costs need active management.
Handoff is what people actually want
The handoff pattern is the most natural fit for the workflows people describe when they ask for multi-agent systems. One agent handles a customer query. When the conversation moves into a domain that needs specialist knowledge, it hands off to another agent. The new agent picks up with full context and either resolves it or hands off again.
If you're building anything resembling a customer service or sales assistant, handoff is probably what you want. The Microsoft AI Agent Framework's handoff implementation is solid. You define which agents can hand off to which, the conditions under which handoff is appropriate, and the framework handles the context transfer.
The thing that catches teams out is handoff loops. Agent A hands off to B because the query is about billing. Agent B looks at it, decides it's actually a technical issue, hands back to A. A re-reads it, hands back to B. We've seen this kill production deployments. The fix is to cap handoffs per conversation, log every handoff with the reason, and treat any conversation that exceeds three handoffs as a flag for human review.
Magentic for complex reasoning
Magentic is Microsoft's newer orchestrator inspired by their research on open-ended problem solving. A lead agent plans the work, dispatches sub-agents to handle pieces, evaluates results, and iterates. It's powerful but expensive and slow.
Use magentic when you genuinely don't know what steps the problem requires in advance. Research tasks, complex troubleshooting, exploratory data analysis. Don't use it for workflows where the steps are knowable, because you'll pay a fortune in tokens for the planning overhead.
We have one client running magentic for incident response triage in a managed services context. The agent plans investigation steps based on the incident signature, dispatches sub-agents to query logs, check related systems, and gather context. It works well because each incident is genuinely different and a fixed sequential pipeline would miss the right diagnostic path 30% of the time. But the cost per incident is around $0.80 AUD, which would be ridiculous for a high-volume use case.
Choosing the pattern
Here's the rough decision framework we use with clients:
| Pattern | Use when | Avoid when |
|---|---|---|
| Sequential | Steps are known and ordered | Workflow has loops or branches |
| Concurrent | Independent subtasks can run in parallel | Steps depend on each other's output |
| Handoff | Specialist routing based on query type | Specialists need to collaborate |
| Group chat | Multiple specialists must collaborate | Cost or latency is critical |
| Magentic | Steps cannot be predetermined | The workflow is repeatable |
Most production systems we ship combine two or three of these. A customer service system might use handoff at the top level to route to the right specialist, then sequential within each specialist for the resolution steps.
State management is where it gets ugly
The framework gives you conversation memory and thread state out of the box, but production multi-agent systems need more than that. You need durable state that survives process restarts, state that can be queried for analytics, and state that gives you audit trails for compliance.
The Microsoft pattern is to use Cosmos DB or Azure SQL for durable agent state, with the framework's checkpoint mechanism to write state at meaningful boundaries. This works but it's not turnkey. You have to decide what state to persist, when to write, and how to handle the recovery case where the agent comes back with stale tools or model versions.
I'd recommend writing state at every handoff boundary, every tool call completion that mutates external systems, and every major decision the agent makes. Don't try to persist token-level state because the cost and complexity isn't worth it. If a long-running agent process dies mid-thought, the right behaviour is usually to resume from the last completed step, not the exact word it was generating.
For Australian deployments with data residency requirements, keep this state in Australian Azure regions. We've seen audit findings where state was being written to default storage accounts in East US because nobody set the region explicitly. Fix that early.
Observability is non-negotiable
If you can't see what your agents are doing, you can't run them in production. Full stop.
The framework integrates with OpenTelemetry and the standard approach is to push traces to Azure Monitor or another OTel-compatible backend. Every agent invocation should produce a span. Every tool call should produce a span. The parent-child relationship between agents in a multi-agent workflow should be visible in the trace.
What this gives you is the ability to look at a specific request that failed, see exactly which agents ran, what they decided, what tools they called, what the model returned, and where things went wrong. Without this, multi-agent debugging is a nightmare. The agents are conversational and non-deterministic. The same input can produce different paths. If you don't have full traces, you'll burn weeks chasing intermittent issues.
We typically also push key business metrics as custom OTel attributes. For an agent processing insurance claims, things like claim type, amount, fraud risk score, and time-to-decision. This gives the business view alongside the technical view in the same dashboard. The Azure AI Foundry consultants team at Team 400 sets this up by default on every engagement.
Testing multi-agent systems
This is where most teams give up and ship something they can't really verify. Multi-agent systems are hard to test because the behaviour is non-deterministic, the failure modes are emergent, and the cost of running tests is real.
Here's the testing approach that's worked for us:
Unit-test the tools. Every tool an agent can call should have proper tests. These are deterministic and cheap.
Test individual agents with frozen prompts. Pin the system prompt, the model version, and a set of representative inputs. Verify the agent calls the right tools in the right order. Use snapshot testing for the outputs, accepting that some variation is expected.
Integration-test the orchestration with a deterministic model. For the orchestration logic itself, you can substitute a stub model that returns canned responses. This lets you verify handoffs, group chat dynamics, and termination conditions without paying for real model calls.
Run scenario tests against a real model in a separate environment. Pick 20 to 50 scenarios covering the main happy paths, edge cases, and adversarial cases. Run these on every meaningful change. Accept that some flakiness is normal but track flake rate and act when it exceeds about 5%.
Production monitoring as a test loop. Treat early production traffic as continuous testing. Sample traces, review them with the team weekly, and feed issues back into the scenario test suite.
Cost modelling for multi-agent systems
Multi-agent costs are dominated by tokens and they scale in non-obvious ways. A sequential three-agent pipeline costs roughly 3x a single agent. A group chat with three agents and five rounds is more like 15x. A magentic system can be 30x or more depending on the complexity of the planning.
The right way to budget is to model the cost per business transaction, not per token. Pick the typical conversation or workflow shape, count the agent invocations, estimate tokens per invocation, multiply through. Then add 30% because real production is messier than estimates.
For typical Australian mid-market deployments, we see total monthly inference costs ranging from $500 to $15,000 AUD depending on volume and complexity. Anything above that needs serious cost optimisation work. Anything below probably isn't doing enough to matter.
Prompt caching is the single biggest lever. Azure OpenAI's prompt caching is genuinely useful and often gets cost down by 40 to 70% on systems with shared context across calls. Smaller models for routing decisions, larger models only for the work that needs reasoning. Pruning conversation history aggressively in long-running threads.
When not to use Microsoft AI Agent Framework
I should be honest about when this isn't the right choice. If your team is Python-first and has zero .NET footprint, the cognitive overhead of working in C# isn't worth it. LangGraph or the Python AI Agent Framework SDK might fit better. If you're not committed to Azure, you'll find some of the integration benefits don't apply.
If you're building research prototypes that may never see production, smaller frameworks are easier to iterate with. The framework's value is in the production patterns and observability, which matter less when you're throwing things away.
If you're a startup at the prototype stage, just use whatever lets you ship fastest. You can rebuild on the AI Agent Framework when scale forces the question.
Where to start
For Australian organisations that want to get this right, the practical sequence is:
- Pick one real workflow that's well-understood and ships meaningful value
- Build it on the AI Agent Framework using whichever orchestration pattern fits
- Wire up OpenTelemetry from day one
- Set cost budgets and monitor against them
- Iterate based on production traces, not on speculation about what users want
Most teams that struggle with multi-agent systems struggle because they tried to build a generic platform first. Don't. Build one valuable thing, learn from it, then think about platform.
We work with Australian organisations across Sydney, Melbourne, and Brisbane on these projects. If you're sizing up a multi-agent build and want to talk through the design choices before committing, get in touch through our contact page. We're happy to do a one-hour design review at no cost, because frankly, watching teams ship the wrong pattern and rebuild six months later is painful for everyone.
For deeper technical guidance, see our work as Microsoft AI Agent Framework consultants and the broader AI agent developers practice.