How to Build a RAG Application with LangChain and Azure
Retrieval Augmented Generation (RAG) is the most common pattern for building AI applications that work with your organisation's data. Instead of fine-tuning a model on your documents, you retrieve relevant information at query time and include it in the prompt. The result is an AI application that can answer questions about your data accurately and with citations.
LangChain and Azure are one of the strongest combinations for building production RAG applications. LangChain provides the orchestration framework. Azure OpenAI Service provides the LLM. Azure AI Search provides the vector store and retrieval engine. Together, they give you a stack that's production-ready, enterprise-grade, and keeps your data within Azure's compliance boundaries.
We've built dozens of RAG applications on this stack for Australian businesses. Here's how to do it right.
The Architecture
A production RAG application on LangChain and Azure has these core components:
Documents --> Ingestion Pipeline --> Azure AI Search (vector store)
|
User Query --> LangChain Retriever --> Retrieved Chunks
|
LangChain Chain --> Azure OpenAI --> Response
Document ingestion: Your documents (PDFs, Word files, web pages, database records) are processed, split into chunks, embedded using an embedding model, and stored in Azure AI Search.
Retrieval: When a user asks a question, the query is embedded and used to find the most relevant document chunks in Azure AI Search.
Generation: The retrieved chunks are injected into a prompt alongside the user's question, and Azure OpenAI generates an answer grounded in your data.
Each of these stages has design decisions that significantly affect quality, cost, and performance.
Step 1 - Set Up Azure Services
You need three Azure services:
Azure OpenAI Service
Deploy two models:
- A chat model (GPT-4o or GPT-4o-mini) for generating responses
- An embedding model (text-embedding-3-large or text-embedding-3-small) for converting text to vectors
GPT-4o-mini is our default recommendation for most RAG applications. It's fast, cost-effective, and handles retrieval-grounded responses well. Use GPT-4o when you need stronger reasoning over complex documents.
For embeddings, text-embedding-3-large gives better retrieval quality. text-embedding-3-small is cheaper and faster if you're processing millions of documents.
Australian data residency: Azure OpenAI Service is available in the Australia East region. If your data must stay in Australia, deploy here. Performance is good and latency is low for Australian users.
Azure AI Search
Create a search service and configure it for vector search. Azure AI Search supports hybrid search (combining keyword and vector search), which we recommend as the default retrieval strategy. Hybrid search consistently outperforms pure vector search in our testing.
Choose your pricing tier based on document volume:
- Basic: Up to 2 GB of data, good for proof of concepts
- Standard S1: Up to 25 GB, suitable for most production applications
- Standard S2/S3: For larger document collections
Azure Blob Storage
Store your source documents in Blob Storage. This gives you a clean separation between raw documents and the processed index, making it easy to re-index when you change your chunking strategy.
Step 2 - Build the Ingestion Pipeline
The ingestion pipeline is where most RAG projects succeed or fail. Getting retrieval right matters more than getting generation right.
Document Loading
LangChain has document loaders for most file types. For a typical enterprise deployment:
from langchain_community.document_loaders import (
PyPDFLoader,
Docx2txtLoader,
UnstructuredExcelLoader,
)
# Load different document types
pdf_loader = PyPDFLoader("document.pdf")
docx_loader = Docx2txtLoader("document.docx")
For production systems, we build a document processing service that watches Blob Storage for new files, processes them automatically, and updates the index. This keeps the index current without manual intervention.
Chunking Strategy
How you split documents into chunks has the biggest single impact on retrieval quality. Get this wrong and no amount of prompt engineering will save you.
Our recommended defaults:
- Chunk size: 800-1200 tokens. Smaller chunks improve retrieval precision. Larger chunks provide more context per retrieval.
- Chunk overlap: 200 tokens. Overlap prevents information from being lost at chunk boundaries.
- Splitting method: Use recursive character splitting with awareness of document structure (headings, paragraphs, tables).
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(documents)
What we've learned from production deployments:
Tables are the hardest content type to handle. Standard text splitting destroys table structure. For documents with important tables, consider extracting tables separately and storing them as complete units.
Metadata matters. Attach source document name, page number, section heading, and document date to every chunk. This metadata enables filtering at query time and powers citation features.
Embedding and Indexing
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import AzureSearch
embeddings = AzureOpenAIEmbeddings(
azure_deployment="text-embedding-3-large",
azure_endpoint="https://your-resource.openai.azure.com/",
)
vector_store = AzureSearch(
azure_search_endpoint="https://your-search.search.windows.net",
azure_search_key="your-key",
index_name="your-index",
embedding_function=embeddings.embed_query,
)
vector_store.add_documents(chunks)
For production, batch your embedding calls and implement retry logic. Azure OpenAI has rate limits, and a large document set can take hours to embed if you're not batching efficiently.
Step 3 - Build the Retrieval Chain
Basic RAG Chain
from langchain_openai import AzureChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
llm = AzureChatOpenAI(
azure_deployment="gpt-4o-mini",
azure_endpoint="https://your-resource.openai.azure.com/",
temperature=0,
)
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 5},
)
template = """Answer the question based on the following context.
If you cannot find the answer in the context, say so clearly.
Always cite which document and section your answer comes from.
Context:
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate(
template=template,
input_variables=["context", "question"],
)
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True,
)
Hybrid Search
We strongly recommend hybrid search (combining vector similarity with keyword matching) over pure vector search. Azure AI Search supports this natively.
retriever = vector_store.as_retriever(
search_type="hybrid",
search_kwargs={"k": 5},
)
In our testing, hybrid search improves answer accuracy by 15-25% compared to pure vector search, particularly for queries containing specific terms, product names, or technical jargon.
Adding a Re-ranker
For higher-quality retrieval, add a re-ranking step. Retrieve more candidates (say 20), then use a cross-encoder model to re-rank them and keep the top 5. Azure AI Search offers a built-in semantic ranker that works well for this.
retriever = vector_store.as_retriever(
search_type="semantic_hybrid",
search_kwargs={"k": 5, "fetch_k": 20},
)
This adds latency (typically 200-400ms) but meaningfully improves answer quality for complex queries.
Step 4 - Production Hardening
Getting a RAG chain working is the easy part. Making it production-ready is where the real work begins.
Evaluation Framework
You need a way to measure whether your RAG application is giving good answers. Build an evaluation dataset of at least 50-100 question-answer pairs covering your key use cases.
Metrics to track:
- Retrieval precision: Are the retrieved chunks relevant to the question?
- Answer correctness: Is the generated answer factually correct based on the source documents?
- Answer faithfulness: Does the answer stay grounded in the retrieved context (no hallucination)?
- Citation accuracy: Do the citations point to the correct source documents?
Run this evaluation after every change to your chunking strategy, prompts, or model configuration.
Error Handling
Production RAG applications need to handle:
- LLM API failures: Azure OpenAI can return 429 (rate limit), 500 (server error), or timeout. Implement retry with exponential backoff.
- Empty retrieval: Sometimes no relevant documents are found. Your prompt should instruct the model to say "I don't have information on this" rather than hallucinating.
- Token limit exceeded: Long retrieved contexts can exceed the model's context window. Implement truncation or summarisation for cases where retrieved content is too large.
- Malformed documents: Your ingestion pipeline will encounter corrupted PDFs, password-protected files, and scanned images. Handle these gracefully.
Observability
Every production RAG application needs tracing. You should be able to see, for any user query:
- What the user asked
- What chunks were retrieved (and their relevance scores)
- What prompt was sent to the LLM
- What the LLM returned
- How long each step took
- How much it cost (in tokens)
LangSmith is the most common observability tool for LangChain applications. Azure Application Insights can also work if you prefer to keep everything in Azure.
Cost Management
RAG application costs come from three sources:
| Cost Component | Typical Range (AUD/month) |
|---|---|
| Azure OpenAI (GPT-4o-mini) | $200-$5,000 |
| Azure OpenAI (embeddings) | $50-$500 |
| Azure AI Search (Standard S1) | $350-$400 |
| Azure Blob Storage | $10-$50 |
| Total | $610-$5,950 |
The biggest variable is LLM usage. To manage costs:
- Use GPT-4o-mini as default, only escalate to GPT-4o for complex queries
- Cache frequent queries and their responses
- Limit the number of retrieved chunks sent to the LLM
- Implement usage limits per user or department
Common Mistakes We See
Not testing chunking strategies
Most teams pick a chunk size, index their documents, and never test alternatives. We've seen RAG quality improve by 30% just by changing from 500-token chunks to 1000-token chunks with 200-token overlap. Test at least three chunking strategies before settling on one.
Retrieving too few or too many chunks
Retrieving 3 chunks often misses relevant information. Retrieving 20 chunks floods the prompt with noise. Start with 5-7 chunks and adjust based on your evaluation results.
Ignoring document preprocessing
Raw PDFs often contain headers, footers, page numbers, and formatting artifacts that pollute your chunks. Clean your documents before chunking. Remove headers/footers, handle page breaks, and preserve table structure.
Skipping metadata filtering
If your document collection spans multiple departments, products, or time periods, add metadata filters to your retrieval. A user asking about "2025 leave policy" shouldn't get chunks from the 2023 policy document.
Not planning for document updates
Documents change. Policies get updated, product information changes, staff directories evolve. Your ingestion pipeline needs to handle updates, not just initial loads. Design for incremental re-indexing from the start.
When to Go Beyond Basic RAG
Basic RAG works well for straightforward question-answering over a document collection. Consider more advanced patterns when:
- Multi-step reasoning: The user's question requires synthesising information from multiple documents. Look at LangGraph for multi-step retrieval agents.
- Conversational context: The user is having a multi-turn conversation. Add conversation memory and query reformulation.
- Structured data: You need to query databases or APIs alongside documents. Consider adding tool-calling capabilities.
- Large document collections: When you have millions of chunks, retrieval quality degrades. Consider hierarchical indexing or document-level routing.
These advanced patterns add complexity and cost. Start simple and add sophistication based on real user needs, not anticipated ones.
How Team 400 Can Help
We've built production RAG applications on LangChain and Azure for businesses across Australia - from internal knowledge bases for professional services firms to customer-facing support systems for financial services companies.
Our LangChain consulting engagements cover architecture design, implementation, and production deployment. We also work extensively with Azure AI services and can help with the full stack from document ingestion to user interface.
If you're planning a RAG application and want to get the architecture right from the start, talk to our team. We offer a two-week proof of concept that uses your actual documents and data, so you can see real results before committing to a full build. Check out our full range of AI services or learn more about our AI agent development capabilities.