Back to Blog

OpenAI Agents SDK - MCP Integrations and Observability in Practice

April 13, 20268 min readMichael Ridland

You've got an agent running. It can answer questions, maybe call a function or two. Now you need to connect it to real systems and figure out what it's actually doing at runtime. That's where MCP integrations and observability come in, and they're the parts of the OpenAI Agents SDK that turn a toy demo into something you could actually put in front of users.

OpenAI's documentation on integrations and observability covers the API surface. I want to talk about the practical decisions - when to use hosted MCP vs local, how tracing actually helps (and when it doesn't), and what we've learned wiring these up in production for Australian organisations.

MCP - Two Flavours, Very Different Use Cases

MCP (Model Context Protocol) is the standard for connecting AI agents to external tools. The OpenAI Agents SDK supports two approaches, and picking the wrong one causes headaches that are annoying to fix later.

Hosted MCP - Let the Platform Handle It

Hosted MCP tools run through OpenAI's infrastructure. You point the SDK at a remote MCP server URL, and the model calls it directly through the platform.

import { Agent, hostedMcpTool } from "@openai/agents";

const agent = new Agent({
  name: "MCP assistant",
  instructions: "Use the MCP tools to answer questions.",
  tools: [
    hostedMcpTool({
      serverLabel: "gitmcp",
      serverUrl: "https://gitmcp.io/openai/codex",
    }),
  ],
});

The Python version looks similar:

from agents import Agent, HostedMCPTool

agent = Agent(
    name="MCP assistant",
    instructions="Use the MCP tools to answer questions.",
    tools=[
        HostedMCPTool(
            tool_config={
                "type": "mcp",
                "server_label": "gitmcp",
                "server_url": "https://gitmcp.io/openai/codex",
                "require_approval": "never",
            }
        )
    ],
)

Hosted MCP is the right choice when you're connecting to a public, trusted MCP server and you don't need to inspect or filter the requests. It's the simplest path - no connection management, no server lifecycle to handle.

The limitation is trust. Your requests go through OpenAI's infrastructure to the MCP server. If the MCP server is internal to your network, or if you need to apply custom approval logic before tools execute, hosted MCP won't work.

Local MCP - You Own the Connection

Local MCP servers run in your application's process space. You manage the connection, the lifecycle, and any filtering or approval logic.

import { Agent, MCPServerStdio, run } from "@openai/agents";

const server = new MCPServerStdio({
  name: "Filesystem MCP Server",
  fullCommand: "npx -y @modelcontextprotocol/server-filesystem ./sample_files",
});

await server.connect();

try {
  const agent = new Agent({
    name: "Filesystem assistant",
    instructions: "Read files with the MCP tools before answering.",
    mcpServers: [server],
  });

  const result = await run(agent, "Read the files and list them.");
  console.log(result.finalOutput);
} finally {
  await server.close();
}

The Python equivalent uses an async context manager, which handles cleanup automatically:

async with MCPServerStdio(
    name="Filesystem MCP Server",
    params={
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "./sample_files"],
    },
) as server:
    agent = Agent(
        name="Filesystem assistant",
        instructions="Read files with the MCP tools before answering.",
        mcp_servers=[server],
    )
    result = await Runner.run(agent, "Read the files and list them.")
    print(result.final_output)

Local MCP is what we use for most client deployments. The reasons are practical: we want to control which tools the agent can call, we need to log every tool invocation for audit purposes, and the MCP servers are often on internal networks.

The trade-off is that you're responsible for connection management. That try/finally block in the TypeScript example isn't optional - if you don't close the server connection, you'll leak child processes. The Python async context manager is cleaner, which is one reason we often prototype agent systems in Python even when the production system will be TypeScript.

Which One Should You Pick?

Here's our decision framework from building agents for enterprise AI deployments:

Pick hosted MCP when:

  • The MCP server is publicly accessible
  • You trust the server operator
  • You don't need custom approval flows
  • You want the simplest possible integration

Pick local MCP when:

  • The MCP server is on your internal network
  • You need to filter or approve tool calls before execution
  • You need audit logging of every tool invocation
  • You need to control the server lifecycle

In practice, most production systems end up using local MCP. The control it gives you over security boundaries is worth the extra code.

Tracing - The Debugging Story

Tracing is built into the Agents SDK and enabled by default. Every agent run emits a structured record of what happened - model calls, tool invocations, handoffs, guardrail checks. You can inspect these in the OpenAI Traces dashboard.

This is one of the better parts of the SDK, honestly. Most agent frameworks treat observability as an afterthought. The fact that you get structured traces out of the box, without installing a separate library or configuring a collector, is genuinely useful.

A default trace captures:

  • The overall run
  • Each model call with its prompt and response
  • Tool calls and their outputs
  • Handoffs between agents
  • Guardrail evaluations
  • Custom spans you define yourself

That last point is worth expanding on. You can wrap multiple agent runs in a single trace to see how they relate:

import { Agent, run, withTrace } from "@openai/agents";

const agent = new Agent({
  name: "Joke generator",
  instructions: "Tell funny jokes.",
});

await withTrace("Joke workflow", async () => {
  const first = await run(agent, "Tell me a joke");
  const second = await run(agent, `Rate this joke: ${first.finalOutput}`);
  console.log(first.finalOutput);
  console.log(second.finalOutput);
});
from agents import Agent, Runner, trace

agent = Agent(
    name="Joke generator",
    instructions="Tell funny jokes.",
)

async def main():
    with trace("Joke workflow"):
        first = await Runner.run(agent, "Tell me a joke")
        second = await Runner.run(
            agent,
            f"Rate this joke: {first.final_output}",
        )
        print(first.final_output)
        print(second.final_output)

The withTrace / trace wrapper creates a parent span that groups both runs together. In the dashboard, you see the full workflow as a single unit instead of two disconnected runs. This matters when you're debugging multi-step agent workflows where the output of one run feeds into the next.

What Tracing is Good For

Debugging individual runs. When an agent does something unexpected, the trace shows you exactly what happened - which tools it called, what data it received, where it made a decision. This is dramatically faster than adding print statements and re-running.

Understanding tool usage patterns. Over time, traces reveal which tools your agent calls most, which ones fail frequently, and where latency spikes happen. We've used this data to optimise tool implementations for clients - sometimes a tool that returns too much data is causing the agent to make poor decisions, and you only see that pattern by looking at traces in aggregate.

Feeding evaluations. Once you have traces from real usage, you can extract examples and use them to build evaluation datasets. The OpenAI docs talk about this as a pipeline from tracing into agent evaluation, and it's a solid approach. Real traces are better training signal than synthetic test cases.

What Tracing Isn't Good For

Tracing doesn't replace proper application logging. Traces are structured around the agent's perspective - what the model saw and did. They don't capture application-level concerns like user session state, business rule violations, or infrastructure metrics. You still need your regular logging and monitoring stack.

Also, traces go to OpenAI's infrastructure by default. If you're in a regulated industry or have data residency requirements, check whether the trace data includes any content that shouldn't leave your environment. You can dial down tracing granularity at the SDK level or disable it entirely per run.

Putting It Together

The typical setup we deploy for clients combines local MCP with full tracing:

  1. Agent connects to internal MCP servers for business-specific tools (database queries, API calls, document retrieval)
  2. Tracing captures every interaction for debugging and audit
  3. Custom approval logic wraps sensitive tools - the agent can read data freely but needs approval before writing
  4. Traces feed into a weekly review process where we check for unexpected behaviours or degraded performance

This pattern works across industries - we've deployed variations of it for professional services firms, retail operations, and government agencies. The specific tools change, but the architecture stays consistent.

Things to Watch Out For

MCP server startup time. Local MCP servers that use npx to install packages on first run add several seconds of latency. In a web application where the agent runs per-request, that's too slow. Pre-install your MCP server dependencies or use a persistent server process instead of spawning one per request.

Trace data volume. With tracing enabled by default, a busy agent system generates a lot of trace data. OpenAI's dashboard handles this fine, but if you're exporting traces to your own systems, plan for the volume. We've seen agent workflows that generate 50+ spans per user interaction.

Connection cleanup. I mentioned this earlier but it bears repeating: if you use local MCP servers, always close them when you're done. Leaked child processes will eat memory and eventually crash your application. The Python async context manager pattern is the safest approach. In TypeScript, use try/finally religiously.

Tool approval latency. If you add approval workflows to your MCP tools, think about the user experience. An agent that pauses for human approval on every tool call is frustrating to use. Group approvals by sensitivity level - let read operations through automatically, require approval for writes and deletes.

If you're building agent systems on the OpenAI Agents SDK and want help with the MCP integration architecture or production observability setup, our AI consulting team has been through these deployments. The patterns are settling down, and there's no reason to learn the gotchas the hard way.