Back to Blog

Programmatic Tool Calling in Claude - Why It Matters for AI Agent Performance

March 29, 20268 min readMichael Ridland

There's a bottleneck in most AI agent architectures that people don't talk about enough. Every time an agent needs to call a tool - query a database, check an API, read a file - it has to do a full round trip back to the model. The model generates a tool call, your code executes it, the result goes back to the model, the model processes it, and maybe generates another tool call. Repeat.

For a simple task that needs one or two tool calls, this is fine. For an agent that needs to check 20 employee expense reports, query three different APIs per employee, and then summarise the results? You're looking at 60+ round trips. Each one burns tokens, adds latency, and inflates costs.

Anthropic's programmatic tool calling feature fixes this, and after spending time building with it, I think it's one of the most practically useful features for production AI agents right now.

The Problem With Traditional Tool Calling

Let me paint a concrete picture. Say you're building an agent that audits budget compliance. It needs to look up expenses for each person on a team, compare them against their budget, and flag anyone who's over.

With traditional tool calling, here's what happens:

  1. Claude reads the prompt and decides to call get_expenses for Employee 1
  2. Your code runs the query, returns the result
  3. Claude processes the result, decides to call get_expenses for Employee 2
  4. Your code runs the query, returns the result
  5. Repeat for employees 3 through 20
  6. Claude now has all the data in context and writes the summary

That's 20 round trips minimum. Each trip includes API latency, token costs for the growing context (every previous result stays in the conversation), and processing time. For 20 employees with detailed expense data, you might be pushing hundreds of thousands of tokens through the model just to collect the data before any analysis happens.

The context bloat is the sneaky part. By the time Claude has fetched all 20 expense reports, those raw results are sitting in the conversation context. The model has to process all of that on every subsequent turn, even though it only needs a summary.

How Programmatic Tool Calling Works

Programmatic tool calling lets Claude write code that calls your tools directly within a code execution container. Instead of 20 separate round trips, Claude writes a Python script that loops through all 20 employees, calls the expense API for each one, filters the results, and returns only the employees who exceeded their budget.

One round trip. The code runs in a sandbox, makes the tool calls programmatically, processes the results, and returns a concise summary. Claude's context only sees the filtered output, not the raw data from every single query.

The setup is straightforward. You enable the code execution tool and add allowed_callers to your tool definitions to specify which tools can be called from code:

tools = [
    {"type": "code_execution_20260120", "name": "code_execution"},
    {
        "name": "query_database",
        "description": "Execute a SQL query against the sales database.",
        "input_schema": {
            "type": "object",
            "properties": {
                "sql": {"type": "string", "description": "SQL query to execute"}
            },
            "required": ["sql"]
        },
        "allowed_callers": ["code_execution_20260120"]
    }
]

That allowed_callers field is the key. It tells Claude that this tool can be invoked from within the code execution environment, not just through the normal tool-calling flow. Claude can then write Python that calls query_database multiple times, aggregates the results, and returns just what matters.

Where the Performance Gains Show Up

Anthropic tested this on agentic search benchmarks - BrowseComp and DeepSearchQA - which test multi-step web research and complex information retrieval. Adding programmatic tool calling on top of basic search tools was, in their words, "the key factor that fully unlocked agent performance."

That matches what we've seen in practice. The gains compound in three areas:

Latency. Fewer round trips means faster execution. For an agent that would normally make 15-20 tool calls sequentially, you're collapsing that into one code execution. The wall-clock time improvement can be dramatic - from minutes to seconds.

Token costs. This is the big one. In the traditional approach, every tool result gets added to the context, and the model processes the entire growing context on each turn. With programmatic tool calling, the code filters and aggregates before the results hit the model's context. If 20 database queries return 5,000 rows total but only 3 employees are over budget, the model only sees those 3 employees. That's potentially a 100x reduction in tokens.

Output quality. Less noise in the context means better reasoning. When the model doesn't have to wade through thousands of irrelevant expense line items to find the three that matter, its analysis is sharper. There's less chance of hallucination or confusion from context overload.

Practical Use Cases

Here's where we're seeing programmatic tool calling make the biggest difference:

Data aggregation agents. Any workflow where an agent needs to collect data from multiple sources, compare it, and produce a summary. Financial reporting across divisions, compliance checking across teams, inventory audits across warehouses. These are all patterns where the traditional approach generates massive context bloat.

Search and research agents. An agent that needs to search across multiple databases or APIs, cross-reference results, and synthesise findings. Instead of searching one source, processing it, searching another, processing it, and so on, the agent writes code that queries all sources, deduplicates results, ranks relevance, and returns only the top findings.

Batch processing. Anything that involves doing the same operation across a list of items. Send personalised emails to 50 contacts, update records for 100 accounts, validate data across 200 rows. The code handles the loop, the model handles the intelligence for each item, and the context stays clean.

Multi-step validation. Check if a proposed change meets multiple criteria by querying different systems. Instead of round-tripping for each validation check, write code that runs all checks and returns a pass/fail summary.

What to Watch Out For

This feature requires the code execution tool to be enabled, which means your API calls use the code execution container. There are a few practical considerations.

Not eligible for Zero Data Retention. If your organisation requires ZDR for compliance, programmatic tool calling isn't available to you right now. Data is retained according to the standard retention policy. For regulated industries, this might be a dealbreaker.

Tool design matters more. When tools are called from code, their input/output schemas need to be clean and predictable. The code Claude writes expects structured responses. If your tools return inconsistent formats or have ambiguous error handling, the code execution will struggle. Spend time on your tool schemas.

Debugging is different. When something goes wrong in a traditional tool-calling flow, you can trace each step - here's the tool call, here's the result, here's the model's next decision. With programmatic tool calling, the logic lives in generated code. You need to inspect the code execution output to understand what happened. Build in logging.

Model compatibility varies. Not all models support this feature. Check the tool reference documentation for current compatibility. As of now, it works with Claude Opus 4.6, and the feature is available through both the Claude API and Microsoft Foundry.

How We're Using It

At Team 400, we've been incorporating programmatic tool calling into our AI agent projects for clients who need agents that interact with multiple data sources. The pattern that comes up most often is the "collect, filter, summarise" workflow - gather data from several places, apply business rules to filter it down, and present actionable results.

For one client, we rebuilt a reporting agent that was making 30+ sequential API calls per report. With programmatic tool calling, the same report generates in about a third of the time and costs roughly a quarter of the tokens. The output quality is also better because the model isn't trying to reason over a massive pile of intermediate results.

We're also finding it useful for agents built on Azure AI Foundry, particularly since Claude is now available there. You can design agents that use programmatic tool calling within the Azure governance framework, which matters for enterprise clients.

Getting Started

If you're building AI agents that make more than a handful of tool calls per task, programmatic tool calling is worth adopting. The implementation pattern is:

  1. Enable the code execution tool in your API calls
  2. Add allowed_callers to tools that should be callable from code
  3. Test with prompts that require multiple tool invocations
  4. Compare latency, token usage, and output quality against your traditional approach

The full documentation is available on the Claude API docs. Anthropic also published a useful deep-dive on the engineering blog about advanced tool use patterns that's worth reading.

If you're working on agent architectures and want help designing efficient tool-calling patterns, our AI development team has been building these systems across industries. The difference between a well-architected agent and a naive implementation is often 5-10x in cost and latency - and programmatic tool calling is one of the biggest levers available right now.