How to Build an AI Knowledge Base from Your Company Documents
Your company's knowledge is scattered across SharePoint sites, shared drives, Confluence pages, email threads, and the heads of people who've been around for 15 years. How do you make all of that searchable and accessible through an AI assistant that actually gives accurate answers?
This is one of the most common projects we deliver at Team 400. Building an AI knowledge base - technically called Retrieval-Augmented Generation or RAG - is one of the highest-value AI implementations for enterprise. When done well, it dramatically reduces the time people spend searching for information.
I'm Michael Ridland, founder of Team 400, and here's exactly how we approach these projects.
What Is an AI Knowledge Base?
An AI knowledge base combines your company documents with a large language model. When someone asks a question, the system finds the most relevant documents, feeds them to the LLM as context, and generates an answer grounded in your actual content.
It's different from a traditional search engine. Instead of returning a list of documents that might contain the answer, it gives you the answer directly - with references to the source documents so you can verify.
User: "What is our return policy for enterprise customers?"
Traditional search: Returns 12 documents mentioning "return policy"
AI Knowledge Base: "Enterprise customers can return products within 60 days
of purchase for a full refund, provided the product is in original condition.
Custom-configured items are subject to a 15% restocking fee.
[Source: Enterprise Sales Policy v4.2, Section 3.1]"
The second response saves minutes of reading and eliminates guesswork about which document is current.
Why RAG Instead of Fine-Tuning?
We get asked this a lot. For knowledge bases, RAG beats fine-tuning almost every time.
RAG advantages for knowledge bases:
- Documents update without retraining the model
- Sources are traceable - you can verify every answer
- No expensive training step
- Works with any document format
- Scales to millions of documents
- Respects access controls (users only see answers from documents they're authorised to access)
Fine-tuning bakes knowledge into model weights. It goes stale the moment your documents change, and you can't trace where an answer came from. For a knowledge base, that's a deal-breaker.
The Architecture
Here's what a production AI knowledge base looks like.
Document Ingestion Pipeline
Documents flow through a pipeline that prepares them for search.
Source Documents → Extraction → Chunking → Embedding → Vector Store
↓ ↓
Metadata Search Index
Document extraction converts files into plain text. PDFs, Word documents, PowerPoint slides, HTML pages, emails - each format needs its own extractor.
Chunking splits documents into smaller pieces (typically 500-1,500 tokens). The LLM's context window is limited, and smaller chunks mean more precise retrieval.
Embedding converts each chunk into a vector - a numerical representation of its meaning. Similar content produces similar vectors, which is how semantic search works.
Vector storage saves these embeddings in a database optimised for similarity search.
Query Pipeline
When a user asks a question, the system runs a retrieval and generation pipeline.
User Question → Embedding → Vector Search → Top Chunks → LLM → Answer
↓
Optional: Reranking
Optional: Hybrid Search
Embed the question using the same model that embedded the documents.
Search the vector store for the most similar chunks.
Rerank (optional) uses a second model to re-score the retrieved chunks for relevance. This significantly improves accuracy.
Generate sends the question plus retrieved chunks to the LLM, which produces an answer grounded in the provided context.
Step-by-Step Implementation
Step 1 - Audit Your Document Sources
Before building anything, catalogue what you're working with.
Questions to answer:
- Where do documents live? (SharePoint, shared drives, Confluence, databases, email)
- How many documents? What's the total size?
- What formats? (PDF, DOCX, PPTX, HTML, markdown, email)
- How often are documents updated?
- Are there access controls that need to be respected?
- What's the quality like? (Are there outdated documents that should be excluded?)
This audit shapes every technical decision that follows. A knowledge base with 500 well-maintained policy documents is a very different project from one with 200,000 documents across six systems.
Step 2 - Choose Your Tech Stack
Here's our recommended stack for Australian organisations.
Vector database: Azure AI Search. It provides both vector search and traditional keyword search (hybrid search), it runs in Australian regions, and it integrates natively with Azure OpenAI. For smaller projects, PostgreSQL with the pgvector extension is a capable and cost-effective alternative.
Embedding model: Azure OpenAI text-embedding-3-large. Best accuracy for the cost. Runs within your Azure tenant.
LLM: Azure OpenAI GPT-4o for generation. For high-volume, lower-complexity queries, GPT-4o-mini reduces cost significantly.
Orchestration: We typically use LangChain or Semantic Kernel for the RAG pipeline, deployed as an Azure Function or Container App.
Document connectors: Microsoft Graph API for SharePoint and OneDrive. Custom connectors for other sources.
Frontend: A chat interface built into your existing intranet, Teams app, or standalone web application.
Step 3 - Build the Ingestion Pipeline
This is the backbone of your knowledge base.
Document extraction:
For SharePoint and OneDrive, use Microsoft Graph API to access files. For PDFs, Azure AI Document Intelligence gives you clean text extraction including tables and structure. For Office documents, standard libraries handle DOCX and PPTX well.
Watch out for:
- Scanned PDFs (need OCR, not just text extraction)
- Documents with complex tables or charts
- Password-protected files
- Very large documents (need special chunking strategies)
Chunking strategy:
How you split documents matters more than most people expect. Bad chunking leads to bad retrieval leads to bad answers.
Our recommended approach:
- Use semantic chunking where possible - split at natural boundaries (sections, paragraphs, headings)
- Keep chunks between 500-1,500 tokens
- Include overlap between chunks (100-200 tokens) so context isn't lost at boundaries
- Preserve document metadata with each chunk (title, section, date, source URL)
- For structured documents, keep tables and lists as complete units
# Example chunk metadata
{
"content": "Enterprise customers can return products within 60 days...",
"metadata": {
"source": "Enterprise Sales Policy v4.2",
"section": "3. Returns and Refunds",
"last_updated": "2026-03-15",
"document_url": "https://sharepoint.example.com/policies/sales-policy-v4.2",
"access_group": "sales-team"
}
}
Embedding and indexing:
Generate embeddings for each chunk and store them in your vector database along with the chunk text and metadata. Azure AI Search makes this straightforward with its built-in vector indexing.
Set up incremental indexing - only re-process documents that have changed since the last run. For SharePoint, the delta query API tracks changes efficiently.
Step 4 - Build the Query Pipeline
Basic retrieval:
Start with straightforward vector similarity search. Embed the user's question, find the top 5-10 most similar chunks, pass them to the LLM.
Hybrid search:
Combine vector search with keyword search. Vector search is great for semantic meaning but can miss exact terms. Keyword search catches specific names, codes, and identifiers. Azure AI Search supports hybrid search natively.
In our testing, hybrid search consistently outperforms pure vector search by 10-15% on accuracy benchmarks.
Reranking:
After retrieving 20-30 candidate chunks, use a reranking model to score them for relevance to the specific question. This pushes the most relevant chunks to the top. Azure AI Search includes a semantic ranker, or you can use a dedicated reranking model like Cohere Rerank.
Prompt construction:
The prompt you send to the LLM is critical. Here's a template that works well:
You are a helpful assistant that answers questions based on company documents.
Use ONLY the provided context to answer the question.
If the context doesn't contain enough information to answer, say so.
Always cite your sources.
Context:
[Retrieved chunks with source information]
Question: [User's question]
Answer:
The instruction to only use provided context is essential. Without it, the model will fill gaps with its general knowledge, which may be wrong for your specific organisation.
Step 5 - Handle Access Controls
This is where many RAG implementations fall short. If your documents have access controls, your AI knowledge base must respect them.
Document-level filtering: Tag each chunk with its access groups during ingestion. At query time, filter the search to only include chunks the current user has access to.
Azure AI Search security filters support this natively. You can add a security field to each document and filter at search time.
# At query time, only search documents the user can access
search_filter = f"access_groups/any(g: g eq '{user_group}')"
This ensures a junior employee never receives answers drawn from executive-only documents.
Step 6 - Evaluate and Improve
Before going live, measure your knowledge base's performance.
Build an evaluation dataset: Create 50-100 question-answer pairs from your documents. Include straightforward questions, multi-document questions, and questions where the answer isn't in the documents.
Metrics to track:
- Retrieval accuracy - Are the correct documents being retrieved?
- Answer accuracy - Is the generated answer correct?
- Source citation accuracy - Are sources correctly attributed?
- Hallucination rate - How often does the model generate unsupported information?
Iterate on:
- Chunking strategy (size, overlap, boundaries)
- Number of retrieved chunks
- Prompt instructions
- Reranking configuration
- Hybrid search weighting
Small changes in these parameters can produce significant accuracy improvements.
Common Pitfalls
Ignoring document quality. Garbage in, garbage out. If your SharePoint has 10 versions of the same policy and 8 are outdated, the AI will sometimes cite the wrong one. Clean up your document sources before building the knowledge base.
Chunks that are too small or too large. Too small and you lose context. Too large and retrieval becomes imprecise. Test different sizes with your actual documents.
Not testing with real users. Your evaluation dataset won't cover all the ways real people ask questions. Run a pilot with a small group and track what questions they ask and whether answers are correct.
Skipping access controls. This is a security risk, not a nice-to-have. Implement access filtering from day one.
Over-engineering the first version. Start with basic vector search and a good prompt. Add hybrid search, reranking, and advanced features based on measured gaps in performance.
What Does It Cost?
For a mid-sized organisation (10,000-100,000 documents), typical monthly costs:
| Component | Monthly Cost (AUD) |
|---|---|
| Azure AI Search (Standard tier) | $350-700 |
| Azure OpenAI (embeddings) | $50-200 |
| Azure OpenAI (generation) | $200-2,000 |
| Compute (Functions/Container Apps) | $100-300 |
| Storage | $20-50 |
| Total | $720-3,250 |
Generation cost scales with usage. More users asking more questions means higher LLM spend. Using GPT-4o-mini for straightforward queries and GPT-4o only for complex ones helps manage this.
Getting Started
The best AI knowledge bases start small and expand. Pick one document collection that your team frequently searches - policy documents, technical documentation, product information - and build a knowledge base around it. Once that's working well, add more sources.
If you want to build an AI knowledge base for your organisation, talk to our team. We design and build production RAG systems for Australian businesses. Learn more about our AI development services and AI agent capabilities to see how a knowledge base fits into a broader AI strategy.