Back to Blog

Guardrails and Human Review in OpenAI Agents - Getting Safety Right

April 10, 20269 min readMichael Ridland

Every AI agent demo looks the same. The agent receives a request, makes a decision, calls a tool, returns a result. Clean, fast, impressive. Then you put it in production and someone asks the agent to cancel every order in the system, or the agent starts processing requests it was never designed to handle, or a tool call fires off an email to a client before anyone reviewed it.

Safety controls for AI agents aren't optional once you move past the demo stage. The OpenAI Agents SDK has two main mechanisms for this - guardrails for automatic checks and human-in-the-loop approvals for decisions that need a person. They work at different points in the agent's execution and solve different problems, but together they define when a run should continue, pause, or stop entirely. The official documentation covers the API in detail, so I'll focus on design decisions and what we've learned deploying these in practice.

Input Guardrails - Catching Bad Requests Early

An input guardrail runs before the main agent starts processing. Its job is to validate the incoming request and block it if something is wrong. The concept is simple, but the implementation has some design choices worth thinking through.

In the OpenAI Agents SDK, an input guardrail is itself an agent. You create a small, focused agent whose only job is to classify the incoming input and determine whether it should be allowed through. Here's the pattern in TypeScript:

const guardrailAgent = new Agent({
  name: "Homework check",
  instructions: "Detect whether the user is asking for math homework help.",
  outputType: z.object({
    isMathHomework: z.boolean(),
    reasoning: z.string(),
  }),
});

const agent = new Agent({
  name: "Customer support",
  instructions: "Help customers with support questions.",
  inputGuardrails: [
    {
      name: "Math homework guardrail",
      runInParallel: false,
      async execute({ input, context }) {
        const result = await run(guardrailAgent, input, { context });
        return {
          outputInfo: result.finalOutput,
          tripwireTriggered: result.finalOutput?.isMathHomework === true,
        };
      },
    },
  ],
});

The guardrail agent runs, analyses the input, returns a structured output, and the tripwire either blocks or allows the request. If blocked, an InputGuardrailTripwireTriggered exception is thrown that you catch and handle in your application code.

The runInParallel flag is a meaningful choice. Setting it to false means the guardrail runs first, and only if it passes does the main agent start. Setting it to true means both start simultaneously, and if the guardrail trips while the main agent is already running, the run is interrupted. Parallel execution gives lower latency because you're not waiting for the guardrail before starting the main work. But it means you might waste compute on the main agent for requests that get blocked.

Our recommendation for most production deployments - use blocking execution (parallel set to false) for guardrails that protect against high-cost or dangerous actions. Use parallel execution for guardrails that filter out low-risk noise where the occasional wasted main agent run is acceptable. If the main agent's first action is calling an external API or modifying data, blocking is almost always the right choice.

Output Guardrails - Checking Before Delivery

Output guardrails run after the agent produces a response but before that response reaches the end user. They validate or redact the final output.

This is useful for things like PII detection (blocking responses that accidentally include personal information), compliance checking (making sure the agent's response aligns with regulatory requirements), and content policy enforcement (filtering responses that violate your organisation's guidelines).

The pattern is similar to input guardrails, just positioned at the other end of the pipeline. An output guardrail agent reviews the response and either allows it through or trips the wire.

One thing to watch for - output guardrails only run on the agent that produces the final output. If you have a chain of agents where Agent A hands off to Agent B, the output guardrail on Agent A doesn't check Agent B's output. If Agent B is the one producing the final response, its output guardrail runs, not Agent A's. This means you need to think about where in your agent chain the final output is actually generated and put your guardrail there.

Tool Guardrails - Validation at the Action Layer

Tool guardrails sit on individual tools and validate arguments or results around function calls. This is the most granular level of control.

Say you have a tool that sends emails. A tool guardrail can check that the recipient is on an approved list, that the content doesn't contain sensitive information, or that the sending rate hasn't exceeded a limit. The validation happens right at the point where the agent is about to take a real-world action.

This is where I think the OpenAI SDK's approach shines compared to some alternatives. Putting validation next to the tool that creates the side effect is more reliable than trying to catch everything at the agent level. An agent-level input guardrail can't predict every possible tool call the agent might make during a complex multi-turn interaction. But a tool-level guardrail catches every invocation of that specific tool, regardless of how the agent arrived at the decision to call it.

For any tool that creates external side effects - sending messages, modifying records, triggering processes - I'd recommend having at least a basic tool guardrail, even if it's just logging.

Human-in-the-Loop Approvals

Guardrails are automatic. They run without human intervention and make binary allow/block decisions. Approvals are different - they pause the entire agent run and wait for a person to decide.

The SDK uses a needsApproval flag on tools. When a tool with this flag is called by the agent, instead of executing, the SDK records an interruption and returns it to your application:

const cancelOrder = tool({
  name: "cancel_order",
  description: "Cancel a customer order.",
  parameters: z.object({ orderId: z.number() }),
  needsApproval: true,
  async execute({ orderId }) {
    return `Cancelled order ${orderId}`;
  },
});

When the agent decides to call cancel_order, the run pauses. Your application receives a result with interruptions and a state object. You present the pending action to a human reviewer, they approve or reject, and you resume the run from the saved state.

let result = await run(agent, "Cancel order 123.");

if (result.interruptions?.length) {
  const state = result.state;
  for (const interruption of result.interruptions) {
    state.approve(interruption);
  }
  result = await run(agent, state);
}

The important thing about this pattern is that it's the same run being resumed, not a new conversation. The agent's context, the conversation history, the decision-making chain that led to the tool call - all of that is preserved in the state. The agent doesn't have to re-reason about whether to cancel the order after the approval comes back. It already decided to, it just needed permission.

Designing Your Approval Workflow

The state object is serialisable. This matters a lot for real-world approval workflows where the reviewer might not be available immediately. You can save the state to a database, send the approval request to a Slack channel or email queue, and resume the run hours or even days later when the approval comes in.

For organisations with formal approval processes, this maps nicely onto existing patterns. An agent generates a purchase order, the run pauses, the PO goes through the normal approval chain, and when it's signed off, the agent run resumes and completes the order.

The key design question is which tools need approvals. Too few and you have agents taking actions nobody reviewed. Too many and you create an approval bottleneck that defeats the purpose of automation.

We generally recommend requiring approvals for actions that are hard to reverse (deletions, cancellations, financial transactions), actions that are visible to external parties (sending emails, posting to social media, filing reports), and actions that exceed defined thresholds (orders above a certain dollar value, modifications affecting more than N records).

Actions that are internal, low-risk, and easily reversible usually don't need human approval. Reading data, generating drafts, doing calculations - let the agent do these without pausing.

Where Guardrails Don't Run

This is a common source of confusion and worth stating clearly.

Input guardrails only run for the first agent in a chain. If Agent A hands off to Agent B, Agent A's input guardrails check the original user input. Agent B's input guardrails don't run on the handoff - they only run if Agent B is the entry point for a new request.

Output guardrails only run for the agent that produces the final output. In a chain of agents, only the last agent's output guardrail matters for the final response.

Tool guardrails run wherever the tool is called, regardless of where in the chain it happens. This is why tool-level validation is the most reliable place for safety checks on specific actions.

If you have a manager-style agent that delegates work to specialist agents, and those specialists have tools with side effects, put your guardrails and approvals on the tools themselves. Don't rely on the manager agent's input guardrail to catch everything that a downstream specialist might do.

Streaming and Approvals

Streaming doesn't change the approval model. If an agent run is being streamed and it hits a tool that needs approval, the stream pauses. The same interruption and state model applies. Wait for the stream to settle, check for interruptions, resolve approvals, and resume.

This is well-designed. Some frameworks treat streaming as a fundamentally different execution mode with its own set of behaviours. The OpenAI SDK keeps it consistent - same state model, same approval flow, same resumption pattern. One less thing to worry about in production.

Getting Started

If you're building agents with the OpenAI SDK and thinking about safety controls, start with tool-level approvals on your highest-risk tools. That gives you immediate protection on the actions that matter most, without requiring you to design a complete guardrail architecture upfront. Add input guardrails next for request-level filtering, then output guardrails for response validation.

For teams building production AI agents and needing help with safety architecture, our AI agent builders work with organisations on agent design and deployment. We also help with broader agentic automation strategy, including deciding which framework fits your requirements and how to structure agent safety controls across your organisation. And if you're working with the OpenAI ecosystem specifically, our team has hands-on experience with the Agents SDK and can help you move from prototype to production with the right guardrails in place - reach out through our AI consulting practice.