How to Set Up AI Monitoring and Observability
You've built an AI system and deployed it to production. It's working well. Then three weeks later, accuracy starts dropping. Users complain about strange responses. By the time someone notices, the damage is done.
This scenario plays out more often than it should. AI systems aren't like traditional software - they degrade silently. The code doesn't throw errors. The API still returns 200 OK. But the quality of outputs quietly deteriorates as the world changes around your model.
I'm Michael Ridland, founder of Team 400, and we've learned the hard way that monitoring is not optional for production AI. Here's how we set it up for every system we deploy.
Why AI Systems Need Different Monitoring
Traditional application monitoring tracks uptime, response times, error rates, and resource usage. That's necessary for AI systems too, but it's nowhere near sufficient.
AI systems have failure modes that don't exist in conventional software:
Model drift - The world changes but the model doesn't. Customer language evolves, product catalogues update, regulations shift. A model trained on 2025 data slowly becomes less accurate in 2026.
Data drift - The input data starts looking different from what the model was trained on. A document extraction model trained on PDF invoices starts receiving scanned handwritten forms. It doesn't crash - it just produces wrong answers confidently.
Prompt injection and adversarial inputs - Users (intentionally or accidentally) send inputs that cause unexpected behaviour.
Hallucination - The model generates plausible-sounding but incorrect information. This can increase over time as the gap between training data and reality grows.
Cost blowouts - A change in usage patterns or a bug in the retrieval pipeline sends token counts through the roof.
Latency creep - Response times gradually increase as context windows fill up or retrieval queries become more complex.
You need monitoring that catches all of these.
The AI Observability Stack
Here's the monitoring architecture we implement for production AI systems.
Layer 1 - Infrastructure Monitoring
This is your standard application monitoring. Track it because if the infrastructure is down, nothing else matters.
Metrics:
- API availability and uptime
- Response time (p50, p95, p99)
- Error rates (HTTP errors, timeout rates)
- CPU, memory, and GPU utilisation
- Queue depths (for async processing)
- Storage usage
Tools: Azure Monitor and Application Insights are our defaults for Azure-hosted systems. Datadog and Grafana are solid alternatives.
Alerts:
- Availability drops below 99.5%
- p95 latency exceeds your SLA (e.g., 5 seconds for chat, 30 seconds for document processing)
- Error rate exceeds 1%
- Resource utilisation exceeds 80%
This layer should be set up before going live. No exceptions.
Layer 2 - LLM-Specific Monitoring
This is where AI monitoring diverges from traditional monitoring.
Token usage and cost tracking:
Track every API call:
- Input tokens (prompt + context)
- Output tokens (completion)
- Model used (GPT-4o vs GPT-4o-mini)
- Cost per request
- Cost per user/department/use case
Set up daily and monthly cost alerts. We've seen a single bug in a retrieval pipeline cause token usage to spike 10x overnight because it was stuffing the entire document collection into every prompt.
// Example log entry for each LLM call
{
"timestamp": "2026-04-22T10:15:30Z",
"model": "gpt-4o",
"input_tokens": 3200,
"output_tokens": 450,
"cost_aud": 0.12,
"latency_ms": 2100,
"user_id": "user_123",
"use_case": "knowledge_base_query",
"session_id": "sess_456"
}
Retrieval quality (for RAG systems):
If you're running a RAG knowledge base, monitor the retrieval step separately from generation:
- Number of chunks retrieved per query
- Relevance scores of retrieved chunks
- Proportion of queries with no relevant chunks found
- Source document distribution (are answers coming from a balanced set of sources or always the same few documents?)
Poor retrieval is the most common cause of poor answers in RAG systems. If your retrieval metrics degrade, you know to look at the indexing pipeline rather than the LLM.
Conversation and interaction tracking:
Log every interaction (with appropriate privacy controls):
- User input
- Retrieved context (or a hash/summary for privacy)
- Model output
- Any tool calls or function calls made
- User feedback (if collected)
This data is essential for debugging issues and improving the system over time.
Layer 3 - Quality Monitoring
This is the layer most teams skip, and it's the most important for AI systems.
Automated quality evaluation:
Set up periodic evaluation runs against a benchmark dataset. This is your canary in the coal mine.
- Maintain a set of 50-200 question-answer pairs that represent your key use cases
- Run these through your system daily or weekly
- Compare outputs against expected answers
- Track accuracy over time
If accuracy drops from 92% to 85% over a month, you catch it before users notice.
LLM-as-judge evaluation:
For generated text where exact matching doesn't work, use a second LLM to evaluate quality. Feed it the question, context, expected answer, and generated answer, then ask it to rate accuracy, completeness, and relevance.
evaluation_prompt = """
Rate the following AI-generated answer on a scale of 1-5 for:
1. Accuracy - Is the information factually correct based on the context?
2. Completeness - Does it fully answer the question?
3. Relevance - Does it stay on topic?
Question: {question}
Context provided: {context}
Expected answer: {expected}
Generated answer: {generated}
Provide ratings and brief justification for each.
"""
This isn't perfect, but it catches significant quality drops reliably.
Hallucination detection:
Monitor for answers that aren't grounded in the provided context. Techniques include:
- Comparing generated claims against source documents
- Tracking the proportion of answers where the model says "I don't have enough information" (a sudden drop might mean it's hallucinating instead of admitting uncertainty)
- Flagging answers that contain specific entities or numbers not present in the retrieved context
User feedback loops:
The simplest and most valuable quality signal. Add thumbs up/down buttons to your AI interface. Track:
- Overall satisfaction rate
- Satisfaction by query type
- Satisfaction trends over time
- Specific queries that receive negative feedback
A satisfaction rate dropping from 85% to 70% over a month is a clear signal something has changed.
Layer 4 - Data and Model Drift Detection
Input distribution monitoring:
Track statistical properties of incoming data:
- Average input length (sudden changes suggest different use patterns)
- Topic distribution (are people asking about new subjects?)
- Language complexity
- Volume patterns (time of day, day of week)
Significant changes in input patterns should trigger investigation. The model may not handle the new patterns well.
Output distribution monitoring:
Similarly, track outputs:
- Average response length
- Confidence score distribution
- Classification distribution (if applicable)
- Proportion of "I don't know" responses
If your classification model suddenly starts putting 40% of inputs into one category when it used to be 15%, something has changed - either the data or the model's performance.
Feature drift for ML models:
If you're running traditional ML models alongside LLMs (which is common for classification and scoring tasks), monitor feature distributions:
- Mean and standard deviation of numerical features
- Category distribution of categorical features
- Missing value rates
Tools like Azure Machine Learning have built-in data drift monitoring that alerts you when distributions shift significantly.
Building Your Monitoring Dashboard
A good AI monitoring dashboard has three views.
Executive view:
- System availability
- Total queries processed
- User satisfaction rate
- Monthly cost
- Quality score trend
Operations view:
- Real-time request volume and latency
- Error rates by type
- Token usage and cost (hourly/daily)
- Active alerts
- Queue depths
Quality view:
- Accuracy trends from automated evaluation
- User feedback rates and trends
- Hallucination rate
- Retrieval quality metrics
- Drift indicators
We typically build these in Azure Dashboards or Grafana, pulling from Application Insights, custom metrics, and evaluation pipelines.
Recommended Tools
Here's our current preferred stack for AI monitoring.
Infrastructure monitoring: Azure Monitor + Application Insights. Native integration with Azure services, good alerting, reasonable cost.
LLM observability: LangSmith (from LangChain), Helicone, or Azure AI Studio tracing. These tools are purpose-built for tracking LLM interactions - token usage, latency, prompt/completion logging, and cost tracking.
Evaluation: Custom evaluation pipelines using Azure Functions on a schedule. We run benchmark evaluations daily and full evaluation suites weekly.
Drift detection: Azure Machine Learning for ML model drift. Custom statistical monitoring for LLM input/output distributions.
Dashboards: Grafana or Azure Dashboards. The choice usually depends on what your ops team already uses.
Alerting: Azure Monitor alerts for infrastructure. Custom alerts via Azure Functions for quality metrics. PagerDuty or Opsgenie for on-call routing.
Implementation Approach
Don't try to build everything at once. Here's the order we recommend.
Before go-live:
- Infrastructure monitoring (Layer 1)
- Token usage and cost tracking (Layer 2)
- Basic interaction logging (Layer 2)
First month in production:
- User feedback collection (Layer 3)
- Automated evaluation pipeline (Layer 3)
- Cost alerting
First quarter:
- Hallucination detection (Layer 3)
- Input/output distribution monitoring (Layer 4)
- Quality dashboard
Ongoing:
- Expand evaluation datasets based on real user queries
- Tune alert thresholds based on observed patterns
- Add new metrics as you learn what matters for your system
Common Mistakes
Monitoring only infrastructure. Your system can have 100% uptime and fast response times while giving completely wrong answers. Infrastructure monitoring is necessary but not sufficient.
Not logging interactions. You can't improve what you don't measure. Log every interaction (with appropriate privacy controls) from day one. You'll need this data for debugging, evaluation, and fine-tuning.
Setting alerts too loose. By the time a 10% accuracy drop triggers an alert, hundreds of users have received bad answers. Set tight thresholds and investigate early.
Not budgeting for monitoring. AI monitoring typically costs 10-15% of your AI infrastructure spend. Build this into the project budget from the start.
Treating monitoring as a one-time setup. Your monitoring needs to evolve as your system evolves. New features, new data sources, and new use cases all need new monitoring.
Getting Started
If you're running AI in production without proper monitoring, the best time to fix that was before launch. The second best time is now.
Start with interaction logging and cost tracking - these take a day or two to set up and immediately provide value. Then build out automated evaluation and quality monitoring over the following weeks.
Need help setting up monitoring for your AI systems? Get in touch with our team. We help Australian organisations build and operate AI systems that stay reliable in production. Learn more about our AI consulting services and AI agent development capabilities.