OpenAI Agent Evals - How to Actually Test Your AI Agents
OpenAI Agent Evals - How to Actually Test Your AI Agents
Here's an uncomfortable truth about most AI agent deployments I've seen: nobody is properly testing them. Teams build an agent, try a few queries manually, decide it "seems to work," and ship it. Then they're surprised when the agent makes a bizarre recommendation to a customer at 2am on a Saturday.
Testing AI agents is hard. I'm not going to pretend otherwise. Traditional software testing doesn't apply directly because the outputs are non-deterministic. You can't write a unit test that says "the agent's response must equal this exact string." But the fact that it's hard doesn't mean you can skip it. And OpenAI's agent evals framework gives you a structured way to approach it.
I've been working through their evaluation tooling on a couple of recent projects, and it's changed how I think about agent quality. Not because it's perfect - it's not - but because it forces you to define what "good" actually means for your agent before you ship.
Why Agent Evals Are Different
Let me explain why regular model evals don't cut it for agents. When you evaluate a language model, you're testing input-output pairs. Given this prompt, does the model produce a response that meets some criteria? That's manageable.
Agents are different. An agent takes multiple steps. It decides which tools to call, in what order, with what parameters. It might hand off to another agent. It might retry a failed tool call. The "output" isn't just text - it's a sequence of decisions and actions that eventually produce a result.
Testing an agent means you need to evaluate the whole trajectory, not just the final answer. Did the agent call the right tools? Did it use the right parameters? Did it handle errors correctly? Did it know when to ask the user for clarification instead of guessing? Did it stop when it should have stopped?
This is a fundamentally different testing problem, and it requires different tooling.
The OpenAI Evals Approach
OpenAI's agent evals framework lets you define test cases for your agents and evaluate them automatically. The basic structure involves three things:
Test cases - These are scenarios you want your agent to handle. Each test case has an input (what the user says or asks), expected behaviour (what the agent should do), and evaluation criteria (how to judge whether it did the right thing).
Evaluators - These score the agent's performance on each test case. You can use model-based evaluators (where another AI model judges the response), rule-based evaluators (exact matches, regex patterns, tool call checks), or custom evaluators that run your own logic.
Runs - You execute your test cases against your agent and collect results. Each run produces scores that you can track over time.
The framework isn't prescriptive about how you define "good." You bring your own criteria. For a customer support agent, "good" might mean accurately identifying the customer's issue and providing the correct resolution steps. For a data analysis agent, "good" might mean calling the right API endpoints with the right parameters and producing a correct summary.
Building Your First Eval Suite
Let me walk through how I set this up in practice.
Step 1 - Collect Real Conversations
The worst mistake you can make is writing synthetic test cases that reflect how you think users will interact with the agent. Real users don't behave the way developers expect. They misspell things. They ask ambiguous questions. They change their mind mid-conversation. They paste in random context and expect the agent to figure it out.
Start by collecting real conversations. If your agent is already in production, pull a sample of actual user interactions. If it's new, do a few rounds of internal testing where people genuinely try to use the agent for its intended purpose - and record the conversations.
From those real conversations, extract the patterns. What questions come up most? What tool calls should the agent make? Where does it struggle? These become your test cases.
Step 2 - Define Your Evaluation Criteria
This is where most teams struggle because it forces you to be specific about what "good" means. Some criteria I use regularly:
Correctness - Did the agent give the right answer or take the right action? This is table stakes.
Tool usage - Did the agent call the appropriate tools? Did it pass the right parameters? Did it call tools in a sensible order? A common failure mode is agents calling tools unnecessarily or with wrong parameters that happen to produce plausible-looking results.
Boundary respect - Did the agent stay within its defined scope? If someone asks a customer support agent for medical advice, it should decline. If someone asks a data agent to delete records when it only has read access, it should say so.
Conversation quality - Was the response clear, appropriately detailed, and at the right level for the user? An agent that dumps a raw JSON response when the user asked "how many orders did we get yesterday?" is technically correct but functionally useless.
Error handling - When a tool call fails or returns unexpected data, does the agent handle it gracefully? Does it retry appropriately? Does it explain the situation to the user instead of silently failing?
Step 3 - Build Your Test Cases
A test case looks roughly like this:
test_case = {
"input": "What's the status of order #12345?",
"expected_tool_calls": [
{"tool": "lookup_order", "params": {"order_id": "12345"}}
],
"expected_response_contains": ["shipped", "tracking"],
"expected_response_excludes": ["I don't know", "I can't help"],
"criteria": {
"correctness": "Response must include the actual order status",
"tone": "Professional and helpful, not overly verbose",
}
}
For each test case, you define what you expect to happen and how to evaluate whether it happened. The evaluation criteria can be as simple or as nuanced as you need.
I usually build three categories of test cases:
Happy path - the agent does what it's supposed to do. Order lookups, answer questions, perform actions. These confirm the basics work.
Edge cases - invalid inputs, missing data, ambiguous requests, multiple possible interpretations. These find the rough edges.
Adversarial cases - prompt injection attempts, out-of-scope requests, attempts to make the agent do something it shouldn't. These test your guardrails.
Step 4 - Run and Iterate
Run your eval suite, review the results, and fix what's broken. Then run it again. This is iterative work, not a one-time setup.
What I've found is that the first round of evals usually reveals things you didn't think of. The agent handles the obvious cases fine but falls apart on edge cases. Or it technically produces correct answers but the phrasing confuses users. Or it calls the right tools but in an inefficient order that makes responses slow.
Track your scores over time. When you change your agent's system prompt, update its tools, or modify its configuration, run the eval suite again and compare. This is how you catch regressions early instead of finding them in production.
Evaluator Types and When to Use Each
Model-based evaluators use another AI model to judge the agent's output. These are good for subjective criteria like tone, helpfulness, and response quality. They're flexible but can be inconsistent - the evaluator model might judge the same response differently on different runs. Use them for qualitative assessment, but don't rely on them exclusively.
Rule-based evaluators check for specific patterns. Did the response contain certain keywords? Did the agent call a specific tool? Did the response stay under a word limit? These are deterministic and reliable. Use them for objective criteria.
Custom evaluators run your own code. You might query your database to verify the agent returned the correct data, or check that a tool call actually performed the expected action. These take more effort to build but give you the most confidence.
My recommendation: use rule-based evaluators for objective criteria (tool calls, response format, boundary checks) and model-based evaluators for subjective criteria (tone, helpfulness, explanation quality). Layer them. Don't pick one type and ignore the others.
What I've Learned Running Agent Evals
Start with 20-30 test cases, not 200. A small, well-designed eval suite that you actually run is infinitely more valuable than a massive suite that you never finish building. You can always add more test cases later as you discover new failure modes.
Test the failure modes, not just the happy path. Most of the value from evals comes from edge case and adversarial testing. Anyone can verify that the agent answers simple questions correctly. The interesting work is finding out what happens when things go wrong.
Run evals in CI/CD. Agent evals should run automatically when you change the agent's configuration. If you only run them manually, you won't run them often enough. Automate the eval suite and set up alerts when scores drop below thresholds.
Human review is still necessary. Automated evals catch a lot, but they don't catch everything. Schedule regular human review sessions where someone reads through actual agent conversations and flags issues. The insights from human review should feed back into new test cases.
Version your eval suite alongside your agent. When you change your agent, you often need to update your eval criteria too. If you add a new tool, add test cases for it. If you change the system prompt, update the expected behaviour. Keep them in sync.
Don't over-optimise for eval scores. This is a real risk. If you tune your agent specifically to score well on your eval suite, you might be overfitting to your test cases rather than actually improving the agent. Keep your eval suite representative of real usage, and keep refreshing it with new real conversations.
How This Fits Into Our Work
At Team 400, we build and deploy AI agents for Australian organisations across multiple frameworks and platforms. Evaluation is part of every engagement, not an afterthought. Whether we're building agents on OpenAI, Azure AI, or other platforms, the principles of proper evaluation apply.
For organisations that are just starting with AI agents and want to understand what a proper development and testing workflow looks like, our AI consulting team can help you set up evaluation pipelines alongside your agent development.
If you're already running agents in production and want to improve their reliability, we offer AI managed services that include ongoing evaluation, monitoring, and tuning.
The OpenAI agent evals documentation is available at Agent evals.