How to Optimise Content Retrieval in Microsoft 365 Copilot Extensibility
Most of the Copilot projects that land on my desk in a bad state have the same root cause. The agent is fine. The instructions are fine. The licences are paid for. But the answers are vague, or wrong, or they quote a document from 2023 when the real answer is sitting in a SharePoint library that nobody connected properly. The model is doing its job. The retrieval layer is letting it down.
This is the part of Copilot extensibility that gets the least attention and causes the most pain. When you build a declarative agent or wire up a Copilot connector, you are effectively building a search engine that an LLM reads from. If that search returns rubbish, the model writes rubbish back, very confidently. Garbage in, confident garbage out.
So let me walk through how content retrieval actually works inside Microsoft 365 Copilot, and the things we do on real engagements to make it return the right material. This is based on the Microsoft guidance for optimising content retrieval, plus a few years of cleaning up retrieval problems for Australian organisations.
What "retrieval" means here, in plain terms
When a user asks Copilot a question, the orchestrator does not just hand the prompt to the model and hope. It first goes and fetches relevant content - from your Microsoft Graph, from connected external systems, from whatever knowledge sources your agent is grounded on. That fetched content gets stuffed into the model's context window alongside the user's question. Only then does the model write an answer.
That fetch step is retrieval. And the quality of what comes back is governed by three things: how your content is indexed, how it is described with metadata, and how well the user's question maps to the words in your documents. Get those right and Copilot feels like magic. Get them wrong and it feels like a slightly drunk intern who skim-read one wrong file.
The thing people miss is that Copilot is doing semantic search, not keyword matching from 2008. It understands that "annual leave" and "holiday entitlement" are related. But semantic search is not telepathy. If your HR policy never uses either phrase and instead says "absence provisions," retrieval will struggle, and no amount of clever prompt instructions will save it.
Metadata is doing more work than your content
If I had to pick one lever that moves retrieval quality the most, it is metadata. Titles, descriptions, modified dates, authors, content types. The orchestrator leans heavily on these when it decides what is relevant and what is stale.
We had a professional services client whose Copilot kept surfacing superseded contract templates. The current templates and the old ones lived in the same library, all with near-identical body text. Semantically they were indistinguishable. The fix had nothing to do with Copilot itself. We added proper metadata - a status field, a clean version label, a last-reviewed date - and configured the connector to weight recency. Suddenly Copilot stopped quoting the 2022 version. The model never changed. The signal it was reading changed.
For Copilot connectors specifically, the schema you define matters enormously. Mark fields as queryable, searchable, and retrievable deliberately rather than ticking every box. A field marked as searchable feeds into relevance ranking. A field marked retrievable comes back in results for the model to read. If you make everything searchable, you dilute the ranking signal and your most important fields stop standing out. Be picky.
One honest warning: setting up connector schema properly is fiddly and the tooling is not friendly. The first connector your team builds will take three times longer than you expect. Budget for that. This is exactly the kind of unglamorous setup work our Microsoft AI consultants end up doing on most Copilot rollouts, because it is the difference between a demo that wows the executive team and a tool people actually trust six months later.
Keep the index honest about freshness
Stale content is the silent killer. Copilot does not know that a document is out of date unless something tells it. If your connector does a full crawl once and then never updates, you are serving frozen answers against a moving business.
Set up incremental crawls so changed items get re-indexed promptly. For external connectors, this means your connector needs to track changes and push updates, not just dump a snapshot. For SharePoint and Graph content, the indexing is handled for you, but permissions and labelling still affect what comes back.
We generally tell clients to think about content the way they think about a physical filing room. If half the folders are unlabelled, three years out of date, and duplicated across four cabinets, hiring a brilliant new assistant does not fix it. Copilot is that assistant. It can only retrieve what you have made findable.
Access control changes the answers, not just the security
Here is something that surprises people. Copilot respects every permission the user already has. It will never show someone content they could not otherwise open. That is the correct behaviour and a genuine strength of the Microsoft approach to grounding.
But it has a sneaky consequence for retrieval quality. Two different users can ask Copilot the exact same question and get different answers, because they have access to different content. If your finance lead gets a great answer and your project manager gets a vague one, the model is not broken. The project manager simply cannot see the source document, so retrieval came back thinner.
When you are testing an agent, test it as several different user personas, not just as the admin who can see everything. The admin always gets the best answers, which is precisely why admin testing gives you a false sense of how good the thing is. I have watched a flagship internal Copilot sail through admin testing and then faceplant on day one because real staff did not have access to half the grounding content.
Write content the way people ask questions
This one feels too obvious to mention until you see the data. The closer your source documents are, in plain language, to the way your people actually phrase questions, the better retrieval works.
A practical move we use: pull the real questions. Look at what people type into the search bar, what they ask the service desk, the wording in support tickets. Then make sure your knowledge content uses that same vocabulary somewhere, even if only in a summary line or a heading. You are building a bridge between how staff talk and how your documents are written.
For declarative agents grounded on a specific set of documents, this matters even more because the retrieval pool is smaller. With a narrow knowledge source, a single badly-titled document can mean a whole topic is effectively invisible. Short, descriptive headings beat clever ones. "How to submit an expense claim" retrieves better than "Reimbursement framework overview," even though the second sounds more corporate.
Scope the knowledge source tightly
There is a real temptation to connect everything. Point Copilot at the entire tenant, every library, every connector, and let it sort things out. Resist that.
Broader is not better for retrieval. A bigger pool means more near-matches competing for the same relevance slots, which means more chances for the orchestrator to pull something tangentially related instead of the bullseye answer. Some of the sharpest agents we have built are grounded on a deliberately small, curated set of sources. A policy agent that only reads the current policy library will outperform one that reads the whole intranet, every time.
When we scope an agent, we start narrow and widen only when we see real questions failing for lack of coverage. That order matters. Starting wide and trying to narrow later is much harder, because you have already trained your users to expect the agent to know everything. This kind of scoping is a core part of how our AI agent builders approach Copilot work, and it is usually where the quiet wins come from.
Test retrieval separately from the model
The single most useful habit I can pass on: evaluate retrieval on its own, before you blame the model.
When an answer is bad, do not immediately rewrite the agent instructions. First ask: did the right content even come back? Look at the citations Copilot provides. If the correct document is not in the citations, this is a retrieval problem, full stop. No instruction tweak will fix it. You need better metadata, better indexing, or better content.
If the right document is cited but the answer is still wrong, now you have a model or instructions problem, and that is a different fix. Separating these two failure modes will save your team weeks. I have seen people spend a fortnight tuning prompts when the actual issue was a connector that had not crawled since March.
We build a small set of known questions with known correct sources, then check whether retrieval surfaces those sources. It is unglamorous regression testing, and it catches problems before your users do. If you want a structured way to run this, our business AI managed services team does exactly this kind of ongoing evaluation as part of keeping Copilot deployments healthy over time, because retrieval quality drifts as your content changes.
What is still rough
I will be honest about the weak spots, because the marketing will not be.
The connector tooling is still more painful than it should be. Schema configuration, change tracking, and permission mapping for external systems take real engineering effort. This is not a no-code afternoon.
Observability into why a particular document did or did not get retrieved is thinner than I would like. You can see citations, but you cannot always see the full ranking logic that decided one document beat another. You end up doing a bit of educated guessing.
And freshness on external connectors depends entirely on you building the update path properly. Microsoft gives you the hooks, but it will not catch a stale index for you.
None of this is a reason to avoid Copilot extensibility. It is a reason to treat retrieval as a first-class engineering concern rather than an afterthought. The organisations getting real value from Copilot are the ones who understood early that the model was never the hard part.
If you are wrestling with a Copilot deployment that gives confident but unreliable answers, the retrieval layer is the first place to look. We help Australian organisations get this right, and you are welcome to get in touch if you want a second set of eyes on yours.
For the original Microsoft guidance this post draws on, see Optimize content retrieval in the Microsoft 365 Copilot extensibility documentation.