Back to Blog

OpenAI Agent Orchestration - Handoffs vs Agents as Tools

April 11, 20268 min readMichael Ridland

When you start building AI agents that do real work, you quickly hit a design question: should one agent handle everything, or should you split the work across multiple specialists? And if you split, how do those specialists talk to each other?

OpenAI's Agents SDK gives you two distinct patterns for this, and picking the wrong one creates problems that are annoying to untangle later. After building multi-agent systems for several Australian clients at Team 400, I've developed opinions on when each pattern works and when it doesn't.

The Two Patterns

OpenAI's orchestration documentation lays out two approaches for multi-agent workflows. They sound similar on the surface but behave very differently in practice.

Handoffs transfer control from one agent to another. The triage agent decides that a billing question should go to the billing specialist, and it hands off the conversation. The billing specialist now owns the response. The triage agent is done.

Agents as tools keep one manager agent in control. The manager calls a specialist to do a specific task - summarise this text, classify this document, extract these fields - and then uses the result to craft its own response. The specialist never talks to the user directly.

The distinction is about ownership. Who's responsible for the final answer that the user sees?

Handoffs in Practice

Here's what a handoff looks like in the TypeScript SDK:

import { Agent, handoff } from "@openai/agents";

const billingAgent = new Agent({ name: "Billing agent" });
const refundAgent = new Agent({ name: "Refund agent" });

const triageAgent = Agent.create({
  name: "Triage agent",
  handoffs: [billingAgent, handoff(refundAgent)],
});

The Python version is nearly identical:

from agents import Agent, handoff

billing_agent = Agent(name="Billing agent")
refund_agent = Agent(name="Refund agent")

triage_agent = Agent(
    name="Triage agent",
    handoffs=[billing_agent, handoff(refund_agent)],
)

The triage agent's job is routing, not answering. When it identifies that a conversation is about billing, it hands off to the billing agent. When it's about refunds, it hands off to the refund agent. Each specialist has its own instructions, tools, and policy constraints.

This pattern works well for customer service scenarios. Think about how a phone tree works - you call in, the system figures out what you need, and it routes you to the right department. Handoffs do the same thing, but with AI agents instead of hold music.

We've used this pattern for a client in financial services who needed different compliance policies for different types of enquiries. The triage agent classified the request, and each specialist agent had the specific regulatory guidelines and tools relevant to its domain baked into its system prompt. Trying to cram all that policy into a single agent made it confused and unreliable.

Agents as Tools in Practice

The alternative keeps a single manager agent in charge:

import { Agent } from "@openai/agents";

const summarizer = new Agent({
  name: "Summarizer",
  instructions: "Generate a concise summary of the supplied text.",
});

const mainAgent = new Agent({
  name: "Research assistant",
  tools: [
    summarizer.asTool({
      toolName: "summarize_text",
      toolDescription: "Generate a concise summary of the supplied text.",
    }),
  ],
});

Here, the research assistant calls the summariser like any other tool. It sends text in, gets a summary back, and decides what to do with it. The summariser never interacts with the user.

This is the right fit when the specialist is doing a bounded, well-defined task. Summarisation. Classification. Entity extraction. Translation. Sentiment analysis. Things where you want a specialist to process something and return a result, but the main agent needs to synthesise the final answer.

I think of it like a manager asking an analyst for a report. The analyst does the work, hands it back, and the manager decides what to present to the client. The analyst doesn't talk to the client directly.

How to Choose

Here's my honest framework, based on building these systems rather than theorising about them.

Use handoffs when:

  • The specialist needs to own the conversation going forward
  • Different branches need fundamentally different instructions, tools, or safety policies
  • The routing decision is clear-cut (billing vs refund vs tech support)
  • You want clean trace logs showing which agent handled which conversation

Use agents-as-tools when:

  • The manager needs to combine results from multiple specialists
  • The specialist does one bounded thing and returns structured output
  • You want stable, predictable outer workflow with nested calls
  • The user should see one coherent response, not be passed between agents

There's a grey area in the middle where either pattern could work. When I'm in that grey area, I default to agents-as-tools because it's easier to debug. With handoffs, tracing why a conversation ended up at the wrong specialist requires understanding the routing logic. With agents-as-tools, the manager's decision process is all in one place.

The One-Agent-First Rule

OpenAI's documentation makes a point that I strongly agree with: start with one agent whenever you can. Add specialists only when they genuinely improve the system.

I've seen teams split their agent into five specialists in the first week because it seemed architecturally clean. Then they spent months debugging routing failures, duplicated instructions, and inconsistent behaviour across specialists. A single agent with well-structured tools and good instructions handles a surprising amount of complexity.

Split when you hit one of these signals:

  • The system prompt is getting so long that the model loses focus on parts of it
  • Different branches need genuinely different tools (a billing agent needs payment APIs that a tech support agent shouldn't have access to)
  • Compliance or policy requires isolation between domains
  • Your traces show the agent getting confused about which "mode" it should be in

If none of those apply, keep it as one agent. Splitting too early creates more prompts, more traces, more testing surface, and more failure modes without necessarily making anything better.

Practical Architecture for Australian Businesses

The multi-agent patterns from OpenAI's SDK map well onto real business scenarios we encounter across our AI consulting work.

Customer service triage is the classic handoff use case. A front-door agent classifies inbound queries and routes them to specialists for billing, technical support, returns, or general enquiries. Each specialist has different knowledge bases and different escalation paths. The handoff model keeps each specialist focused.

Document processing pipelines work better as agents-as-tools. A main agent receives a document, calls a classifier to determine the document type, calls an extractor to pull out key fields, calls a validator to check the results, and then assembles the final output. No single specialist needs to own the conversation because there is no conversation - it's a processing pipeline.

Internal knowledge assistants usually start as single agents and evolve. An assistant that answers questions about company policies, HR processes, and IT procedures can work as one agent with good retrieval. But if the HR answers need different guardrails than the IT answers, splitting into specialists with handoffs starts making sense.

Sales and quoting workflows often use a hybrid. The main agent handles the conversation, but calls specialist tools for pricing calculations, inventory checks, and compliance verification. The customer talks to one coherent agent, but behind the scenes it's coordinating with purpose-built specialists.

What Gets Tricky

A few things about multi-agent orchestration that the documentation doesn't emphasise enough.

Context passing between agents. When you do a handoff, how much conversation history does the specialist receive? By default in the OpenAI SDK, you can pass filtered history. But figuring out what to filter and what to keep is a real design problem. Pass too much and the specialist gets confused by irrelevant context. Pass too little and it asks the user to repeat themselves.

Error handling across agents. If the billing specialist fails mid-conversation, what happens? Does the user get an error? Does it fall back to the triage agent? The SDK gives you the building blocks, but you need to design the error flows yourself.

Testing multi-agent systems. Testing a single agent is already harder than testing regular software. Testing a system where agents hand off to each other, with all the possible routing paths and context combinations, is exponentially harder. Build good test harnesses early. We've found that recording real conversations and replaying them through the agent pipeline is the most practical approach.

Cost management. Each agent call is a separate API call with its own token usage. A handoff means at least two model calls (triage plus specialist). Agents-as-tools means the manager's context includes the full conversation plus tool results. Multi-agent systems use more tokens than single agents. Track your costs from day one.

My Recommendation

For most projects, here's the path I'd suggest.

Start with a single agent. Give it good instructions and appropriate tools. See how far it gets. You'll be surprised - a well-prompted single agent handles a lot.

When you hit the limits of a single agent (prompt too long, tools too many, policy conflicts), split into specialists using the pattern that matches your ownership model. If specialists should own conversations, use handoffs. If they're doing bounded tasks, use agents-as-tools.

Keep the routing surface small and legible. Three to five specialists is usually enough. If you're designing a system with twenty specialist agents, something has gone wrong with your abstraction layer.

And test the routing logic heavily. The biggest failures we've seen in multi-agent systems aren't within individual agents - they're in the routing decisions that send conversations to the wrong specialist, or worse, into loops.

If you're building multi-agent systems and want help with the architecture, our AI agent development team has worked through these patterns across agentic automation projects for Australian businesses. The patterns are well-understood. The hard part is fitting them to your specific use case, and that's where experience with real deployments makes the difference.