OpenAI Structured Outputs - Getting Reliable JSON from LLMs

March 26, 2026•8 min read•Michael Ridland

Anyone who's built a production application on top of an LLM has hit this problem. You ask the model for JSON. Sometimes you get perfect JSON. Sometimes you get JSON wrapped in markdown code fences. Sometimes you get a friendly explanation followed by JSON. And occasionally you get something that looks like JSON but has a trailing comma that breaks your parser at 3am on a Saturday.

OpenAI's Structured Outputs feature fixes this properly. Instead of hoping the model follows your formatting instructions, you define a JSON Schema and the model is guaranteed to produce output that matches it. Not "usually matches" or "matches most of the time" - guaranteed.

We've been using this across our AI development projects since it became available, and it's one of those features that sounds simple but changes how you build things.

Why This Matters

Before Structured Outputs, the standard approach was to include detailed formatting instructions in your prompt ("Return ONLY valid JSON with the following structure..."), then wrap the response in a try/catch, then retry on parse failure, then add even more emphatic instructions, then add a regex-based extraction fallback for when the model wraps its JSON in prose.

Everyone has built some version of this retry-and-parse dance. It works most of the time. But "most of the time" isn't good enough when you're processing thousands of requests per hour and each failure means either a retry (more latency, more cost) or a dropped request (angry user).

Structured Outputs eliminates this entire category of problems. You define the schema. The model follows it. Your parsing code gets simpler. Your error handling gets simpler. Your 3am Saturday nights get quieter.

How It Works

There are two ways to use Structured Outputs in the OpenAI API:

Through function calling - when you're building tool-using agents that need structured parameters
Through response format - when you want the model's direct output to the user in a structured format

Function calling is the right choice when you're connecting the model to external systems - databases, APIs, internal tools. The model generates structured function arguments that your code can parse and execute reliably.

Response format is better when you want to control how the model structures its reply to the user. Think of a math tutoring app that needs step-by-step solutions in a specific format, or a data extraction pipeline that needs structured records from unstructured text.

The distinction matters because it affects how you architect your application. Don't use response format when you should be using function calling, and vice versa.

Practical Example - Python with Pydantic

Here's what this looks like in practice using Python. You define your schema as Pydantic models:

from pydantic import BaseModel

class Step(BaseModel):
    explanation: str
    output: str

class MathReasoning(BaseModel):
    steps: list[Step]
    final_answer: str

Then pass the model to the API call:

completion = client.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "You are a helpful math tutor. Guide the user through the solution step by step."},
        {"role": "user", "content": "how can I solve 8x + 7 = -23"},
    ],
    response_format=MathReasoning,
)

result = completion.choices[0].message.parsed

That result is a proper MathReasoning object. Not a string you need to parse. Not JSON you need to validate. A typed Python object with .steps and .final_answer attributes, ready to use.

No try/catch for JSON parsing. No retry logic. No "please respond only in JSON" prompt engineering.

JavaScript with Zod

The JavaScript SDK uses Zod for schema definition:

import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";

const Step = z.object({
    explanation: z.string(),
    output: z.string(),
});

const MathReasoning = z.object({
    steps: z.array(Step),
    final_answer: z.string(),
});

const completion = await openai.chat.completions.parse({
    model: "gpt-4o-2024-08-06",
    messages: [
        { role: "system", content: "You are a helpful math tutor." },
        { role: "user", content: "how can I solve 8x + 7 = -23" },
    ],
    response_format: zodResponseFormat(MathReasoning, "math_reasoning"),
});

const result = completion.choices[0].message.parsed;

Same idea - define the schema using your language's native type system, pass it to the API, get typed output back. The SDK handles the JSON Schema translation behind the scenes.

Handling Refusals

There's a subtlety here that's worth understanding. Sometimes the model refuses to generate a response - typically for safety reasons. With Structured Outputs, refusals are now programmatically detectable rather than being buried in the text output.

In Python:

message = completion.choices[0].message

if message.refusal:
    print(f"Model refused: {message.refusal}")
else:
    print(message.parsed)

In the Responses API, refusals come as a specific content type:

for output in response.output:
    for item in output.content:
        if item.type == "refusal":
            print(f"Refused: {item.refusal}")
        elif item.parsed:
            print(item.parsed)

This is a big improvement over the old approach where a refusal might come back as a free-text response that your JSON parser would choke on. Now you can handle it as a proper code path.

The Responses API vs Chat Completions

OpenAI has two API surfaces that support Structured Outputs. The Chat Completions API uses response_format with .parse(), while the newer Responses API uses text_format with .parse().

The Responses API is the newer approach and is where OpenAI is putting their development effort. If you're starting a new project, use it. If you have existing code on Chat Completions, there's no urgent reason to migrate - both work fine and will continue to.

The practical difference is mostly in the response structure. The Responses API returns outputs with a content array, while Chat Completions returns choices with messages. The typing and schema definition is identical.

Where We Use This in Production

Let me share a few real patterns from our projects.

Data extraction from documents. A client needed to pull structured information from thousands of supplier invoices - ABN numbers, line items, totals, payment terms. We defined a Pydantic model matching the expected invoice structure and ran each document through GPT-4o with Structured Outputs. The extraction accuracy was high enough that they only needed human review on edge cases rather than every single document.

API response generation. For an internal tool, we needed the LLM to generate responses matching a specific API contract. Without Structured Outputs, we had a 2-3% failure rate on malformed responses that required retries. With Structured Outputs, that dropped to zero. The model physically cannot produce output that doesn't match the schema.

Multi-step reasoning in agents. When building agents that need to plan and execute steps, Structured Outputs ensures each step has the right fields - action type, parameters, reasoning, expected outcome. The agent framework doesn't need defensive parsing logic because the structure is guaranteed.

Classification pipelines. Categorising support tickets, tagging content, sentiment analysis - any task where you need the model to pick from a defined set of options and return additional metadata. JSON Schema's enum support means the model can only return valid categories.

Schema Design Tips

A few things we've learned about designing schemas for Structured Outputs:

Keep schemas focused. Don't try to capture every possible output in one schema. If you have different output types for different scenarios, use separate schemas. A single bloated schema confuses the model and produces worse results.

Use enums for constrained fields. If a field should only contain one of a few values, define it as an enum in your schema. The model will only produce valid values, which eliminates an entire class of validation logic.

Nest sensibly. Structured Outputs supports nested objects and arrays, but deeply nested schemas are harder for the model to fill consistently. Two or three levels of nesting works well. More than that, consider flattening.

Description fields help. Both Pydantic and Zod let you add descriptions to fields. These act like mini-prompts that guide the model on what each field should contain. Use them, especially for fields where the name alone might be ambiguous.

Supported Models

Structured Outputs works with GPT-4o and newer models. Older models like GPT-4 Turbo don't support it - for those, you'd need to fall back to JSON Mode, which guarantees valid JSON but doesn't enforce a specific schema.

If you're starting a new project, there's no reason to use anything older than GPT-4o for structured output tasks. The schema adherence and output quality are significantly better.

What This Means for Your Architecture

Structured Outputs changes how you should think about LLM integration architecture. Code that used to handle parsing, validation, retry, and fallback can be simplified dramatically. Error handling shifts from "what if the format is wrong?" to "what if the model refuses?" which is a much smaller surface area.

For applications processing high volumes - data extraction, classification, content generation - the removal of retry overhead alone makes a measurable difference in throughput and cost.

The OpenAI Structured Outputs documentation covers the full schema specification and additional examples.

How We Can Help

We build AI applications using OpenAI, Azure OpenAI, and other model providers through our AI development practice. Structured Outputs is one of the patterns we use regularly in production systems - from document processing pipelines to conversational agents.

If you're building something that needs reliable structured data from an LLM, or if you've got an existing integration that's fighting with parsing and retry logic, talk to us. We can help you design the right schema, pick the right API surface, and get to production without the usual false starts. Our AI automation consulting team has seen these patterns across enough projects to know what works and what doesn't.