Building a Tool-Using Claude Agent - A Consulting Walkthrough

April 26, 2026•9 min read•Michael Ridland

The leap from "a chatbot that answers questions" to "an agent that does things" is bigger than most people think. The vendor demos make it look easy. The reality, when you start building these for actual Australian businesses, is that the conceptual model takes a week to fully click and another month to feel comfortable with at scale.

Anthropic recently published a clean tutorial on building a tool-using agent. It is a good reference for engineers and I encourage you to read it. What I want to do here is share what we have learned building these agents for real clients, and where the gap between the tutorial and production usually shows up.

The core loop in plain English

A tool-using agent has a deceptively simple control flow. You give the model a user message and a list of tools it is allowed to call. The model responds with either a final answer or a request to call a tool. If it requests a tool, your code executes that tool, sends the result back, and the model continues. This continues until the model decides it has finished.

That is it. Everything else - memory, multi-step planning, parallel tool calls, error recovery - is a refinement of this basic loop.

The reason this matters is that once you understand the loop, you stop thinking of agents as magic and start thinking of them as a programming pattern. The model is making decisions about what to call next. Your job is to give it useful tools, useful descriptions, and a sensible environment to operate in.

What tool descriptions actually do

The single biggest lever in agent quality, by a wide margin, is the quality of your tool descriptions. This is the thing that surprises engineers most. You can have great prompts and bad tools and your agent will be useless. You can have mediocre prompts and great tools and your agent will be quite good.

A tool description is not documentation. It is instructions to a model. The model reads the description and decides whether the tool is relevant to the current task. If the description is vague, it will not be called when it should be. If the description is misleading, it will be called when it should not be.

We had an early project where we built an agent for a logistics client. One of the tools was supposed to return shipment status for a tracking number. The original description said something like "Get shipment information." The agent kept calling it for irrelevant questions, like asking general questions about delivery times. We rewrote the description to say "Returns the current status, location, and estimated delivery time for a specific shipment given its tracking number. Only use this when the user has provided a tracking number or is asking about a specific shipment they have referenced earlier." Behaviour improved immediately.

Take time on this. We probably spend 30 to 40 percent of our development time on tool descriptions and schemas when building agents for clients. It feels like over-investment until you see the quality difference. Our AI agent developers team has internal standards for this that we apply on every build.

The schema design problem

Tool schemas are the other major source of agent quality issues. The schema defines what arguments the model should pass when calling the tool. If the schema is loose, the model will pass loose arguments. If the schema is strict and well-designed, the model will pass clean structured data.

The Anthropic tutorial uses a calendar event example with nested objects, arrays, and optional fields. This is deliberate. Real tools have realistic input shapes, not just a single string. If you only test your agent with toy single-string tools, you are not testing the thing that matters.

A few specific patterns we use:

Use enums for constrained values. If a field should only contain "daily", "weekly", or "monthly", say so in the schema. Do not accept any string. The model will respect the constraint.

Make required fields actually required. If your tool needs a customer ID to function, mark it required. The model will then ask the user for it rather than guessing.

Add format hints. "format": "email" or "format": "date-time" tells the model what shape to produce. It is not enforced by JSON Schema in the usual sense, but the model uses it as guidance.

Avoid catch-all fields. A field called "additional_info" that accepts arbitrary JSON is a recipe for inconsistent inputs. Be specific about what fields exist.

We do a lot of this kind of design work in our enterprise AI agents practice, where the schemas often have to match existing enterprise systems. Getting this right is one of the things that separates a demo from a production system.

Why parallel tool calls are tricky

The newer versions of the Claude API support parallel tool calls. The model can decide to call multiple tools at once if they are independent. This is great for performance. It is also a frequent source of bugs.

The thing that breaks teams is assuming that tools the model calls in parallel are actually independent. In practice, many tools have hidden coupling. Maybe they both write to the same database table. Maybe one rate-limits the other. Maybe the results need to be combined in a specific order to make sense.

We usually start agents in serial mode for the first iteration, then enable parallel tool use only after we have characterised which tools can safely run in parallel. The Anthropic tutorial actually shows disable_parallel_tool_use: true in its first example, which is the right default. Enable parallelism deliberately, not accidentally.

The agentic loop, by hand and by SDK

The tutorial walks through building the loop by hand first, then replacing it with the SDK abstraction. This is a good pedagogical choice. You should do this once.

Once you understand the loop, the SDK saves you significant effort. It handles message accumulation, tool routing, error handling, retries, and several other concerns that are easy to get wrong. We use it for almost every production agent we build through our Claude consultants practice and our broader AI agent builders engagements.

That said, there are scenarios where the SDK abstraction gets in the way. If you need very specific control over how the loop terminates, or you need to inject custom logic between tool calls (for instance, to log specific events, run policy checks, or apply rate limiting), you sometimes need to drop back to a hand-rolled loop. This is not a knock on the SDK. It is recognising that production systems have constraints the SDK cannot anticipate.

A rule of thumb we use: start with the SDK. Move to a custom loop only when you have a specific reason and you can articulate it clearly. "I want more control" is not a good reason. "I need to enforce a per-tenant rate limit on outbound tool calls" is.

The thing that always surprises new teams

Tools are not just for actions. Tools are also for retrieval, validation, computation, and structured output.

The most common pattern in real agents is not "call the API to do the thing." It is "look up information, then make a decision, then call another tool to do the thing." Retrieval tools - search the database, query the knowledge base, fetch the document - are often the majority of an agent's toolbelt.

This matters because teams new to agents tend to design tool sets that are all action verbs. Create. Send. Update. Delete. The agent then has no way to know what state the world is in before acting, so it asks the user a lot of questions or makes assumptions. The agents that work well have a balanced toolbelt with retrieval tools alongside action tools.

We work through this when designing agent architectures for clients in our AI agency engagements. The toolbelt design conversation is one of the longer parts of the engagement and arguably one of the most important.

Where the demos lie

The example agents in vendor tutorials are always tidy. The schemas are clean. The tools are well-behaved. The model makes sensible decisions. Real agents are messier.

A few things that come up in production that the tutorial does not address:

Tool errors. Real tools fail. APIs go down. Databases time out. Permissions get denied. Your agent needs to handle this. The tool_result block can contain an error indication, which the model will respect. But you need to actually return useful error messages, not just "something went wrong."

Cost control. A tool-using agent can rack up significant cost if it gets into a loop. We have seen agents make 40 or 50 tool calls trying to figure out something a smarter prompt could have answered in one. You need monitoring, you need limits, and you need to design the agent to know when to give up.

Latency. Each tool call adds round-trip time. An agent that makes 8 tool calls before answering takes longer than an agent that makes 2. For interactive use cases, this matters. Sometimes the right answer is to give the agent fewer but more powerful tools rather than many small ones.

Conversation state. What does the agent remember between sessions? How do you give it context about a user without retransmitting everything? These are real engineering problems and the SDK only solves them partially.

The AI agent builders and forward deployed engineers work we do is mostly about these production concerns. The "make it work in the tutorial" part is the easy part. The "make it work in production for 5000 users across three time zones" part is where the real engineering happens.

What to actually do if you are starting

If you are an engineer reading this and you have not built a tool-using agent yet, the practical path is roughly:

Build the Ring 1 example from the Anthropic tutorial. Type it out. Run it. Understand it.

Build the agentic loop by hand for a slightly more complex use case. Pick something boring and useful, like a meeting room booking agent or an expense categorisation agent. Make it work end-to-end.

Replace your hand-rolled loop with the SDK. Notice what gets easier.

Now think hard about your tool descriptions and schemas. Iterate. Run the agent on real scenarios and watch where it makes wrong choices. Almost all of those wrong choices trace back to ambiguous descriptions or sloppy schemas.

After that, you are ready to think about production. Cost, latency, monitoring, error handling. None of this is glamorous and all of it is necessary.

If you are a business leader rather than an engineer, the takeaway is simpler. Tool-using agents are real and they work. They are also harder to build well than the demos suggest. Plan for several iterations between the first demo and a system you actually trust in production. We help clients with this through Business AI and the agentic automations practice.

Reference: Anthropic's tutorial on building a tool-using agent is the official walkthrough and worth reading in full.