Testing Microsoft 365 Copilot Agents with the Agents Toolkit in VS Code

June 23, 2026•8 min read•Michael Ridland

There's a particular kind of pain that comes from building a Copilot agent entirely in the browser. You make a change, you publish, you wait, you switch tabs, you ask a question, the answer is wrong, and you start the whole cycle again. By the third hour your patience is gone and you've stopped thinking carefully about each change because the loop is too slow to think in.

That slowness is the single biggest reason agent projects drag. So when a developer on our team is doing serious work on a declarative agent, they're not living in the web builder. They're in Visual Studio Code with the Agents Toolkit, where the build-test loop is fast enough to actually reason about what you're doing. If you write code for a living, this is the environment that feels like home, and it changes how productive you can be on Copilot work.

Let me walk through how we use it, what it does well, and the bits that will catch you out.

Why a code-first loop matters for agents

Most people first meet Copilot agents through the agent builder in the browser. It's fine for getting started. You describe what you want, you point it at some knowledge, you test it in a panel. For a simple agent that one person owns, that's enough.

The trouble starts when the agent gets real. You've got a manifest with proper instructions, multiple knowledge sources, maybe an API plugin or two, and you want this thing in version control so the whole team can work on it. The browser stops being a help and starts being a bottleneck. You can't diff a change. You can't roll back cleanly. You can't see the actual files that define your agent, because they're hidden behind a UI.

The Agents Toolkit flips that around. Your agent becomes a project on disk: a manifest, instruction files, plugin definitions, all sitting in a folder you can open, edit, commit, and review like any other codebase. That alone is worth the switch. Treating an agent as code rather than a configuration you poke at in a web form is the difference between a hobby project and something you can maintain.

We push for this on most of our Copilot Studio and extensibility work, because the agents that survive contact with a real organisation are the ones built like software, not the ones assembled by clicking around.

What the Agents Toolkit actually gives you

The toolkit is an extension you install into VS Code. Once it's in, it does a few things that matter.

It scaffolds a project for you. Rather than starting from a blank manifest and looking up the schema every five minutes, you get a working agent structure with the files in the right places and sensible defaults. For anyone who has hand-written a Copilot manifest from scratch, this is a relief.

It handles the provisioning and registration plumbing. Getting an agent registered against your Microsoft 365 tenant so you can actually run it involves a fair bit of setup, and the toolkit takes most of that off your plate. You sign in, it sorts out the app registration side, and you're testing against your real tenant rather than some sandbox that behaves differently.

And the part that matters most day to day: it lets you launch your agent straight into Microsoft 365 Copilot from inside VS Code. You hit run, it provisions what it needs, and it opens the agent in the actual Copilot interface against your tenant. You're testing the real thing, with your real knowledge connections, not a simulation.

That last point is the whole game. The agent you debug here behaves the way it will behave in production because it is running in production conditions. The orchestrator is the real orchestrator. The knowledge grounding is the real grounding. What you see is what your users will get.

How we run the loop

Here's roughly how a session goes when one of our developers is working on an agent.

Open the project in VS Code. Edit the instructions or the manifest directly in the file, where you can see the whole thing at once rather than through a cramped web panel. Hit run. The toolkit provisions and launches the agent into Copilot. Ask it the questions it's supposed to handle, and a few it shouldn't. Watch what comes back. Go back to the file, change a sentence of instruction, run again.

Because the files are right there, you can be surgical. You're not hunting through a UI for the field you need. You change the exact line, you see the effect, you move on. When something works, you commit it, so you've got a record of what changed and why. When something breaks, you can diff against the last good version and see precisely what you touched.

The other thing this unlocks is proper collaboration. Two developers can work on the same agent through Git, review each other's changes, and merge them, instead of taking turns in a shared browser session and overwriting each other's work. For any agent that more than one person owns, this is not optional. We've inherited agents that were built browser-only by a team that had no idea who changed what, and untangling them is genuinely miserable.

This is the kind of engineering discipline we bring to building custom AI agents generally. The model is only part of the work. The rest is treating the thing like software you'll have to live with.

Where it's still rough

I'm not going to oversell it. The toolkit is good, but there are sharp edges.

The first run is the slow one. Provisioning an agent against your tenant the first time takes a while and occasionally fails in ways that aren't obvious. App registration permissions, tenant policies, the odd consent prompt that didn't appear. When it works it works, but the initial setup can eat an afternoon if your tenant is locked down, which most enterprise tenants are. Budget for that. Don't promise a client a working agent by lunchtime on day one if you've never provisioned in their environment before.

The second thing is that the error messages aren't always friendly. When something in the provisioning or launch step goes wrong, you sometimes get a message that tells you that it failed without telling you why in plain terms. You learn the common causes with experience, but a newcomer can lose an hour to something that turns out to be a single missing permission. This is where having done it a few dozen times earns its keep.

And like all agent testing, the non-determinism is still there. Running locally doesn't make the model deterministic. The same question can come back differently between runs. A fast loop helps you ask things many times quickly, which is exactly what you want, but don't mistake one good answer for a passing test. Ask it five times. Look for consistency, not a single lucky response.

One more honest note: the toolkit is a build-loop tool, not a scale-testing tool. It's brilliant for tight iteration while you're developing. It's not how you prove an agent handles two hundred different intents reliably. For that you want structured evaluation against datasets of test cases, which is a separate discipline and a separate set of tools. The toolkit gets you to "this works when I try it." Evaluation gets you to "this works for everyone, and still works after the next change."

Toolkit versus the browser - which when

People ask which they should use, and the honest answer is both, at different stages.

If you're sketching an idea, exploring whether an agent is even the right tool, or building something small that one person owns, the browser builder is quicker to start with. There's no setup tax. You're testing in minutes.

The moment the agent gets serious, multiple knowledge sources, plugins, more than one person working on it, anything that needs to live in version control, move to the toolkit. The upfront cost of provisioning pays itself back fast once you're iterating properly and want a fast, repeatable, reviewable loop. The agents we ship to production are almost all built this way, because production agents need the engineering rigour the file-based approach gives you.

For organisations rolling Copilot out across a lot of staff, this distinction matters more than it sounds. The difference between an agent that someone clicked together and one that was built, version-controlled, and tested properly shows up the first week real people start using it. We build that rigour into our Microsoft AI consulting work rather than leaving it to whoever happened to assemble the agent.

The takeaways

If you're doing real development on Copilot agents, get into VS Code with the Agents Toolkit. The fast local loop is worth the setup cost, and treating your agent as files you can commit is the single best habit you can build.

Expect the first provisioning run to be the painful one, and don't schedule anything tight around it, especially in a locked-down tenant you've never worked in before.

Use the speed of the loop to ask questions many times, not to skip thinking. Fast iteration is only an advantage if you're still paying attention to what the agent does.

And remember the toolkit gets you to working, not to proven. When the agent is past the experiment stage and headed for real users, graduate to structured evaluation. The toolkit is how you build. Evaluation is how you ship something people depend on.

If you want a hand setting up a proper agent development workflow, or you've got a browser-built agent that's become a tangle and needs putting on a sane footing, that's everyday work for us. Get in touch and we can talk through where your Copilot work is and what it'd take to make it maintainable.

Reference: Test agents using Agents Toolkit, Microsoft 365 Copilot extensibility documentation.