Using the OpenAI Image Generation Tool in AI Agents

April 18, 2026•7 min read•Michael Ridland

Using the OpenAI Image Generation Tool in AI Agents

OpenAI's Responses API includes an image generation tool that lets AI agents create and edit images as part of a conversation. Instead of calling a separate image generation endpoint, you give the model access to the image_generation tool and it decides when and how to generate images based on the conversation context.

This is a meaningful shift from how image generation worked before. Previously, you'd make explicit API calls to DALL-E or similar models with carefully crafted prompts. Now, the mainline GPT model handles prompt refinement automatically and calls the image generation tool when it makes sense in the conversation flow. For agents that need to produce visual content - marketing materials, mockups, data visualisations, product images - this makes the workflow much more natural.

How It Works

You include {"type": "image_generation"} in the tools array when creating a response. The model then has the option to generate images whenever the conversation calls for it.

Here's the basic pattern in Python:

from openai import OpenAI
import base64

client = OpenAI()

response = client.responses.create(
    model="gpt-5.4",
    input="Generate an image of a modern office workspace with natural lighting",
    tools=[{"type": "image_generation"}],
)

image_data = [
    output.result
    for output in response.output
    if output.type == "image_generation_call"
]

if image_data:
    with open("workspace.png", "wb") as f:
        f.write(base64.b64decode(image_data[0]))

And the JavaScript equivalent:

import OpenAI from "openai";
const openai = new OpenAI();

const response = await openai.responses.create({
    model: "gpt-5.4",
    input: "Generate an image of a modern office workspace with natural lighting",
    tools: [{type: "image_generation"}],
});

const imageData = response.output
  .filter((output) => output.type === "image_generation_call")
  .map((output) => output.result);

if (imageData.length > 0) {
  const fs = await import("fs");
  fs.writeFileSync("workspace.png", Buffer.from(imageData[0], "base64"));
}

The response includes the generated image as base64-encoded data in the output array. You filter for items with type: "image_generation_call" and decode the result.

Available Models

The tool works with several GPT Image models:

gpt-image-2 - the latest model with flexible resolution support
gpt-image-1.5 - solid mid-range option
gpt-image-1 - the original, still available
gpt-image-1-mini - faster and cheaper for simpler requests

The model selection for image generation is separate from the conversational model. You specify your conversational model (like gpt-5.4) in the main request, and the image generation tool uses the GPT Image models internally.

Prompt Refinement Happens Automatically

One of the genuinely useful features here is automatic prompt refinement. When you provide a simple prompt like "a cat wearing a hat," the mainline model rewrites it into something more detailed before passing it to the image generator. The revised prompt is included in the response:

{
  "id": "ig_123",
  "type": "image_generation_call",
  "status": "completed",
  "revised_prompt": "A fluffy gray cat wearing a small red beret, sitting upright with a dignified expression. The background is soft and warm-toned.",
  "result": "..."
}

This is helpful for two reasons. First, you get better images without having to write detailed prompts yourself. The model knows what kind of detail produces good results and adds it. Second, you can see exactly what prompt was used, so if the image isn't quite right, you know what to adjust.

For agentic workflows, this is particularly valuable. Your agent can give a natural-language description and let the prompt refinement handle the specifics. Less prompt engineering, better results.

Configuring Output Options

The tool accepts several configuration parameters:

Size - Image dimensions like 1024x1024 or 1024x1536. Supports auto to let the model choose.
Quality - Low, medium, or high. Higher quality takes longer and costs more.
Format - Output file format (PNG, JPEG, WebP).
Compression - For JPEG and WebP, controls compression level from 0-100%.
Background - Transparent or opaque. Note that gpt-image-2 doesn't support transparent backgrounds yet.
Action - Whether to auto-detect, force generate, or force edit mode.

Most of these support an auto option where the model picks the best setting based on your prompt. For most use cases, letting the model decide is fine. Override when you have specific requirements, like needing a transparent PNG for a logo or a specific aspect ratio for social media.

Multi-Turn Editing

This is where the tool gets really interesting for agent workflows. You can iteratively edit images across conversation turns by referencing previous responses.

# First generation
response = client.responses.create(
    model="gpt-5.4",
    input="Generate a product mockup showing a coffee mug with our logo",
    tools=[{"type": "image_generation"}],
)

# Save the first version
image_data = [
    output.result
    for output in response.output
    if output.type == "image_generation_call"
]
if image_data:
    with open("mug_v1.png", "wb") as f:
        f.write(base64.b64decode(image_data[0]))

# Refine it
response_v2 = client.responses.create(
    model="gpt-5.4",
    previous_response_id=response.id,
    input="Make the background a kitchen counter instead of plain white",
    tools=[{"type": "image_generation"}],
)

The previous_response_id parameter carries context from the first generation into the second. The model understands it's editing an existing image, not creating something from scratch. This is much better than starting over each time.

For agent workflows that involve design iteration - "move the text to the left," "change the colour scheme," "add a border" - this multi-turn capability means the agent can refine images through natural conversation without losing context.

You can also use the action parameter to explicitly control whether the model generates a new image or edits an existing one. Setting action: "edit" forces edit mode, while action: "generate" forces new generation. The default auto lets the model decide based on context, which usually works well.

Practical Prompting Tips

A few things I've noticed from working with the image generation tool in agent setups:

Use "draw" or "edit" language. The model responds better to action-oriented prompts. "Draw a diagram showing the network architecture" works better than "network architecture diagram."

For combining images, say "edit" not "merge." If you're providing input images and want elements combined, phrasing like "edit the first image by adding this element from the second image" produces better results than "combine these two images."

Be specific about what you want to keep vs change. In multi-turn editing, clearly state what should stay the same. "Keep the same coffee mug but change only the background" gives better results than "change the background."

Quality settings matter for cost. High quality looks noticeably better for detailed images but costs more. For quick drafts or prototypes, low or medium quality saves money and is often good enough to evaluate the concept.

Where This Fits in Agent Architecture

The image generation tool works best as one capability among many in a multi-tool agent. Consider a marketing agent that can:

Research trending topics (web search tool)
Draft copy (built-in language capability)
Generate accompanying images (image generation tool)
Format the content for different platforms (code execution tool)

The image generation becomes part of a larger workflow rather than a standalone feature. The agent decides when an image would help and generates one without the user having to explicitly request it or switch to a different tool.

For Australian businesses running content marketing or e-commerce operations, this kind of integrated image generation can speed up content production significantly. Instead of a workflow where someone writes copy, sends a brief to a designer, waits for assets, and then assembles the final piece, an agent can produce a draft with images in a single pass. The human reviews and refines rather than orchestrating the entire process.

Limitations to Know About

A few practical limitations worth flagging:

gpt-image-2 doesn't support transparent backgrounds. If you need transparency, use gpt-image-1 or gpt-image-1.5.
Images come back as base64. For large images at high quality, the response payloads can be substantial. Factor this into your API timeout and memory planning.
Prompt refinement changes your input. The revised prompt is usually better, but sometimes the model interprets your intent differently than you intended. Check the revised_prompt field if the output surprises you.
Rate limits apply. Image generation is slower and more resource-intensive than text generation. Build appropriate delays into agent loops that generate multiple images.

Getting Started

If you're already using the OpenAI Responses API for agent workflows, adding image generation is straightforward - it's just another tool in the array. Start with simple single-turn generation to get a feel for the output quality and prompt refinement, then move to multi-turn editing workflows once you're comfortable.

For teams building AI agent systems that need visual content capabilities, the image generation tool is worth integrating early. It's one of those features that opens up use cases you might not have considered - automated report generation with charts, product catalogue image creation, personalised marketing materials.

The full API reference is in OpenAI's image generation documentation, which covers additional options around file ID inputs and advanced configuration.

If you're working on agentic automations that include visual content and want help designing the architecture, our team has built several agent workflows that incorporate image generation as part of larger content pipelines. It's a practical capability that, when integrated well, removes real bottlenecks from content-heavy business processes.