Back to Blog

How to Test Microsoft 365 Copilot Agents in Copilot Studio Before You Ship Them

June 22, 20268 min readMichael Ridland

A client in Brisbane shipped a Copilot agent to about 200 staff last year without really testing it. It looked fine in the builder. The instructions read well. Then on day one, half the questions came back with the agent confidently citing a SharePoint document that had been archived eighteen months earlier. The other half got polite refusals because the agent had no idea which knowledge source to reach for.

None of that was visible until real people started asking real questions. And that is exactly the gap the test experience in Copilot Studio is meant to close. If you are building declarative agents for Microsoft 365 Copilot, the testing step is the bit most teams skip and the bit that bites them hardest.

So let me walk through how we actually test these things, what the tooling does well, and where you still need to do the work yourself.

Why testing Copilot agents is different

If you have built normal software, you are used to deterministic behaviour. Same input, same output. You write a test, it passes or fails, you move on.

Agents do not work like that. The same question phrased two slightly different ways can produce two different answers, pull from two different knowledge sources, or trigger two different actions. The orchestrator behind Microsoft 365 Copilot decides at runtime what to do with the user's request, and that decision depends on your instructions, the knowledge you have wired in, and the way the underlying model interprets all of it.

This means you cannot test a Copilot agent the way you test an API. You are not checking "does function X return value Y." You are checking "does this agent behave sensibly across the messy range of things people will actually type." That is a judgement call, not a pass/fail assertion, and it is why the test loop matters so much.

We see this constantly in our Copilot Studio consulting work. The agents that fail in production are almost never the ones that broke a rule. They are the ones nobody pushed on before launch.

The test panel in Copilot Studio

When you open your agent in Copilot Studio, there is a test panel sitting right next to the configuration. This is your first line of defence and you should be living in it while you build.

The basic loop is simple. You change an instruction or add a knowledge source on the left, then you ask the agent something on the right and watch what it does. The feedback is immediate, so you are not waiting on a publish cycle to find out whether your change helped or made things worse.

What makes this genuinely useful is that you can see more than just the final answer. You can watch which knowledge sources the agent reached for, whether it triggered an action or plugin, and how it strung the response together. When an answer comes back wrong, that trace tells you why. Was it a bad instruction? Did it pick the wrong source? Did it not have access to the data it needed? You learn a lot more from the path than the destination.

The thing I like about testing inside Copilot Studio specifically, as opposed to the lighter-weight builder, is that you are closer to the real orchestration behaviour. The agent gets exercised against the actual knowledge connections and actions you have configured, so what you see in the panel is much closer to what your users will get.

What we actually test for

Over enough projects you build a mental checklist. Here is roughly what we run an agent through before anyone signs off.

The questions it is supposed to answer. Obvious, but worth being deliberate about. Write down the ten or fifteen things this agent exists to handle, then ask all of them. Not once. Phrase each one a few different ways, because your users will. "What's our leave policy" and "how many days off do I get" should land in the same place.

The questions it should refuse. An HR policy agent should not be answering questions about the company's financial results. Test the boundaries. A surprising number of agents happily wander outside their lane because the instructions never told them not to, and you only find out when someone screenshots an awkward answer.

The grounding. When the agent cites a document, click through. Is it the right document? Is it current? This is where that Brisbane client came unstuck. The agent was technically working, it was just pointing at stale content. The test panel shows you the citations, so use them.

The empty and the ambiguous. What happens when someone types "help"? Or asks something half-formed? A good agent asks a clarifying question. A bad one guesses, and guesses badly. These vague inputs are where agents reveal whether the instructions have any real backbone.

Where the tooling is still rough

I am not going to pretend the test experience is perfect, because it is not.

The biggest limitation is that manual testing does not scale. Sitting in the panel typing questions is fine when you have fifteen scenarios. It falls apart when you have a serious agent that needs to handle hundreds of intents across multiple knowledge sources. You cannot manually re-test all of that every time you tweak an instruction, and small instruction changes can have surprisingly wide effects.

Microsoft has been building out evaluation tooling to deal with this, including a way to run agents against datasets of test cases and score the results, which is a much better fit once your agent gets past the toy stage. If you are serious about this, that is the direction to head. The test panel is for the build loop. Structured evaluations are for proving the thing actually works at scale and for catching regressions when you change something.

The other rough edge is reproducibility. Because responses vary, a scenario that passes once might wobble the next time. You learn to ask the same question several times rather than trusting a single good answer. This trips up people coming from traditional QA, who expect a green tick to mean something permanent.

And honestly, the trace view, while good, still asks you to interpret a fair bit yourself. It will show you the agent reached for the wrong knowledge source, but working out whether that is an instruction problem, a knowledge description problem, or just the model being the model takes experience. This is the part of the work where having someone who has debugged a few dozen of these earns its keep.

How this fits a real delivery process

Testing is not a phase that happens at the end. On our projects it runs the whole way through, and it shapes how we build custom AI agents in the first place.

Early on, testing is exploratory. You are building instructions and immediately checking whether they do what you meant. The test panel is open the entire time. You write a sentence of instruction, you test it, you refine it. This tight loop is where most of the quality actually comes from.

As the agent matures, testing gets more structured. You move from "let me check this works" to "let me prove this still works after every change." That is when datasets and evaluation scoring come in, and when you start treating agent quality like something you can measure rather than something you eyeball.

Before any agent goes live, we run a proper user-acceptance pass with the people who will actually use it. Not the project team, who know the magic words, but the staff who will type whatever comes into their head. This catches the gap between "works for the person who built it" and "works for everyone," which is usually wider than anyone expects.

For organisations rolling Copilot out across a lot of staff, this discipline matters even more, and it is something we build into our Microsoft AI consulting engagements rather than leave to chance.

A few honest takeaways

Test early and test often. The cost of finding a problem in the test panel is a few minutes. The cost of finding it after 200 people have lost trust in your agent is a lot more than a few minutes.

Do not trust a single good answer. Variability is real. Ask things several ways and look for consistency, not just correctness.

Watch the citations, not just the words. An agent that sounds right while pointing at the wrong source is more dangerous than one that obviously fails, because nobody questions it.

And once your agent is past the experiment stage, graduate from manual testing to structured evaluation. Typing questions by hand is a fine way to build, but it is not a serious way to ship something hundreds of people depend on.

If you want a hand getting a Copilot agent properly tested and ready for production, that is the kind of work we do every week. Get in touch and we can talk through where your agents are and what it would take to trust them in front of real users.

Reference: Test agents using Copilot Studio, Microsoft 365 Copilot extensibility documentation.