How to Cut Claude API Costs With Prompt Caching for Tool Definitions
If you've been building production agents on the Claude API and looking at your bill at the end of the month, you've probably noticed something. The tool definitions are eating you alive. Every turn of a long conversation re-sends the entire tool catalogue. For an agent with 30 or 40 tool definitions, that's tens of thousands of tokens of repeated payload, every single call.
Prompt caching solves this, but the documentation around tool caching specifically is thin and the rules around invalidation are non-obvious. We've shipped enough agents on the Claude API over the last year to have learned this the hard way. Here's the practical version of how to use it.
Why Tool Definitions Are the Worst Offender
A typical enterprise agent has more tool tokens than system prompt tokens. We see ratios of 5 to 1 or higher on agents that do real work. The tool definitions are verbose by necessity - every parameter needs a description, every enum needs documented values, every nested object needs schema definitions.
Without caching, those tokens get re-encoded on every turn. A 20-turn conversation with 8000 tokens of tool definitions costs you 160,000 input tokens just for the tool catalogue, before any actual prompt content.
With caching, you pay full price for the first turn and get something like a 90% discount on every subsequent turn until the cache expires. For a long-running agent that's the difference between a project that's economically viable and one that isn't.
This matters more in Australia than it does in the US because of the AUD-USD exchange rate. A Claude API project that's marginal in San Francisco is often unworkable in Sydney unless you're aggressive about cost optimisation. Prompt caching is the first lever we pull. Our AI consulting team does a lot of this kind of cost engineering on production deployments.
The Basic Pattern - cache_control on Your Last Tool
The caching API for tools is simple. You put a cache_control block on the last tool definition in your tools array, and Claude caches everything from the first tool through that breakpoint.
{
"tools": [
{
"name": "get_weather",
"description": "Get the current weather in a given location",
"input_schema": { "..." }
},
{
"name": "get_time",
"description": "Get the current time in a given time zone",
"input_schema": { "..." },
"cache_control": { "type": "ephemeral" }
}
]
}
That's it. The breakpoint applies to the entire prefix from the first tool through the marked tool. You don't need to mark every tool individually - in fact you shouldn't, because you only get a limited number of cache breakpoints per request.
The thing to remember is that the cache is order-sensitive. If you reorder your tools between calls, you've invalidated the cache. We've seen teams use a Set data structure to manage tools, which produces non-deterministic ordering, which kills caching silently. Use a list or array and don't reorder it.
How defer_loading Keeps Cache Alive Across Tool Discovery
This is the part that took me a while to internalise but it's the most important pattern for large agents.
Most production agents have more tools than you want to load up front. A finance agent might have access to 200 different APIs. Sending all 200 tool definitions on every call is wasteful and also confuses the model (more on that in a minute).
The standard pattern is to use tool search and defer_loading. You start the conversation with a small set of always-loaded tools - maybe 5 to 10 core capabilities. When the model needs something else, it calls the tool search tool to discover what's available, and the relevant tool definitions are loaded inline as tool_reference blocks in the conversation history.
Here's the key insight: deferred tools are not part of the system prefix. So when one gets discovered and loaded mid-conversation, the prefix doesn't change. Your cache is preserved. You can start a 20-turn conversation with a cached small toolset, let the model discover 15 additional tools along the way, and still get cache hits on every turn.
This is genuinely clever and we've built a few of our enterprise AI agents around this pattern. It's the way to scale agents to hundreds of tools without paying for the full catalogue on every call.
The other benefit, by the way, is that smaller tool catalogues produce better model behaviour. A model with 200 tools available chooses badly. A model with 8 tools available and the ability to discover more chooses well. Tool search is good for performance and good for cost at the same time.
What Actually Invalidates Your Cache
This is where most teams get burned. The cache follows a prefix hierarchy: tools then system then messages. A change at any level invalidates that level and everything after it.
The full list, with my notes from production debugging:
Modifying tool definitions - invalidates everything. Even adding a parameter description to one tool wipes the cache. This is why your tool definitions should be stable. If you're constantly tweaking tool docs in production, you're not getting cache hits.
Toggling web search or citations - invalidates system and messages cache. We had a fun bug where an A/B test toggled web search per request and the team couldn't figure out why caching wasn't working. The web search toggle was the cause.
Changing tool_choice - invalidates messages cache. If your agent dynamically switches between "auto," "any," and specific tool choices, you're paying for it.
Changing disable_parallel_tool_use - invalidates messages cache. Same pattern.
Toggling images present or absent - invalidates messages cache. This one bites multimodal agents. If you sometimes attach screenshots and sometimes don't, you lose cache continuity.
Changing thinking parameters - invalidates messages cache. Toggling extended thinking on and off mid-conversation defeats caching.
The takeaway is that your request shape needs to be stable across turns of a conversation. Pick your settings up front and stick with them. Don't change tool_choice between turns. Don't toggle features on a per-request basis. If you need to vary something, do it at the conversation level, not the turn level.
For one client we ended up with a deliberately stripped-down request schema. We removed every parameter that could vary and ended up with cache hit rates above 95%. Their API bill dropped by 60%.
How Tool Search Interacts With Strict Mode
If you're using strict mode (the constrained-decoding feature that guarantees tool calls match your schema), there's a useful detail. defer_loading is independent of grammar construction. The grammar is built from your full toolset regardless of which tools are deferred.
This means you get both benefits at once - strict mode's reliability and tool search's caching efficiency. You don't have to choose. The model still produces valid JSON for any tool it might call, including deferred ones, because the grammar already knows about them.
In practice this means you can build large strict-mode agents without paying for the full tool catalogue on every call. We've shipped a couple of agent automation systems that use this combination for production reliability and cost control.
Per-Tool Caching Quirks Worth Knowing
A few specific tools have caching quirks worth knowing:
Web search and web fetch - any toggle of these invalidates system and messages caches. Don't enable them conditionally. Enable them for the whole conversation or not at all.
Code execution - the container state is separate from the prompt cache. The container persists across calls but the cache works normally. Worth knowing if you're using code execution and seeing unexpected costs.
Tool search - discovered tools load as tool_reference blocks, which preserves prefix cache. This is the entire pattern we've been talking about.
Computer use - screenshot presence affects messages cache. If you alternate between turns with and without screenshots, you'll get spotty caching.
Text editor, bash, memory - all standard client tools with no special caching interactions. Use them normally.
A Practical Caching Strategy for New Agents
If you're building a new agent and want to set up caching properly from day one, here's the order we recommend:
Pin your request shape early. Decide on tool_choice, parallel tool use, thinking, and image handling at architecture time. Don't change them per request.
Stabilise your tool definitions. Write them carefully and don't tweak them in production. Treat them as a versioned contract.
Put cache_control on the last tool in your always-loaded set. Watch your cache hit rate via the API response.
If you have more than 15 to 20 tools, set up tool search with defer_loading from the start. Even if you don't need it for cost reasons yet, you'll need it for model performance reasons soon.
Monitor cache hit rates in production. The API response tells you how many cached tokens you read versus how many you wrote. A healthy agent should be reading cached tokens at 10 times the rate it's writing them, often more.
The teams that get this right run their agents at a fraction of the cost of teams that don't. The teams that don't get it right usually realise it three months in when the bill hits a number their CFO wants to talk about.
If you're running into these problems on your own Claude builds, our AI agent developers and Claude training practice work through it with clients regularly. The cost optimisation alone usually pays for the engagement.
For the official Claude documentation on this, see Tool use with prompt caching.