Running Azure AI Services on Kubernetes - A Practical Guide to AKS for AI Workloads
Most enterprise AI conversations start with "which model should we use?" and end with "where does this actually run?" That second question is where things get interesting, and where Azure Kubernetes Service (AKS) has become one of the most important pieces of the Microsoft AI stack.
We've been deploying AI workloads on AKS for Australian enterprises for a while now, and the pattern keeps repeating. An organisation starts with Azure OpenAI in the cloud, things work well, then someone from compliance or security asks hard questions about data residency, network isolation, or inference latency. That's when Kubernetes enters the picture.
Why Kubernetes for AI Workloads?
The short answer: control. AKS gives you fine-grained control over where your data is processed, how your models are served, and what your infrastructure looks like. For a lot of Australian organisations - especially those in regulated industries - that control isn't optional.
Azure AI Services provides Docker containers that let you run the same AI capabilities you get from the cloud APIs on your own infrastructure. Speech to text, OCR, document intelligence, language understanding - all of it can run as containers inside your AKS cluster. The models are the same ones Microsoft runs in the cloud. You're just choosing where they execute.
This matters for a few specific reasons:
Data never leaves your network. If you're processing medical records, legal documents, or classified information, some compliance frameworks require that data stays within your own boundary. Running AI containers on AKS means your data goes from your application to your cluster and back. It never hits an external endpoint.
Latency you can predict. When you're building real-time applications - think manufacturing quality inspection or live speech transcription - round-tripping to a cloud API adds latency you might not be able to afford. Local inference on AKS keeps response times tight and consistent.
Cost at scale. If you're processing millions of documents or hours of audio, the per-call pricing of cloud APIs can add up fast. Running containers on reserved AKS infrastructure with GPU nodes can work out significantly cheaper at high volume.
Microsoft's tutorial on preparing AI applications for AKS walks through the container setup, but I want to focus on the bigger picture - what works, what's tricky, and what we've learned deploying this in practice.
KAITO - This Is the Interesting Part
The Kubernetes AI Toolchain Operator (KAITO) is where AKS for AI really starts to shine. If you've ever tried to deploy an open-source LLM on Kubernetes manually, you know how painful it can be. GPU scheduling, model downloading, serving infrastructure, memory management - it's a full-time job just getting the thing running, let alone running well.
KAITO abstracts most of that away. You describe what you want - which model, how many replicas, what kind of GPU - and KAITO handles the provisioning. It integrates with vLLM, a high-throughput inference engine, which means you get proper batching, paged attention, and all the performance optimisations that matter when you're serving models at scale.
Here's what a KAITO workspace definition looks like for deploying a model:
apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
name: llama-workspace
resource:
instanceType: "Standard_NC24ads_A100_v4"
labelSelector:
matchLabels:
apps: llama
inference:
preset:
name: "llama-3-8b-instruct"
That's it. KAITO provisions the GPU node, downloads the model, sets up the serving infrastructure, and gives you an endpoint. Compare that to the dozens of Kubernetes manifests, Dockerfiles, and configuration scripts you'd need to do it manually.
The AI Toolchain Operator add-on for AKS makes KAITO a first-class citizen in Azure. You enable it on your cluster, and you get access to a growing catalog of supported models. We've deployed Llama, Mistral, and Phi models through KAITO, and the experience is significantly better than rolling your own serving stack.
That said, I should be honest about where KAITO is right now. The preset model list is growing but still limited compared to what you can run manually. If you need a specific model variant or a custom fine-tuned model, you might still need to do some manual work. The tooling is also still maturing - we've hit edge cases around GPU memory allocation and node scaling that required workarounds. It's getting better with every release, but don't expect it to handle every scenario without some hands-on tuning.
Connecting AKS to Azure OpenAI
Not every workload needs to run locally. A common pattern we see is a hybrid setup: some AI processing happens on the cluster (for latency-sensitive or compliance-constrained workloads), while other processing calls out to Azure OpenAI for tasks where cloud inference is fine.
AKS supports this cleanly through Service Connector and Workload Identity. The setup looks like this:
- Enable Workload Identity on your AKS cluster
- Create a managed identity and federate it with your Kubernetes service account
- Use Service Connector to bind your Azure OpenAI resource to your application
The result is your pods can call Azure OpenAI without managing any secrets or connection strings. The authentication flows through Entra ID, which means your existing RBAC policies apply, and you get full audit trails of which workload called which model.
env:
- name: AZURE_OPENAI_ENDPOINT
valueFrom:
secretKeyRef:
name: azure-openai-binding
key: endpoint
This is a much cleaner pattern than embedding API keys in your pod specs or mounting secrets from Key Vault (though Key Vault still has its place for other credentials). Workload Identity is the right way to do this in 2026.
Building Intelligent Apps on AKS
Where this gets practical is when you combine AI containers with your existing application workloads. One of the projects we delivered recently was a document processing pipeline for a financial services client. The architecture looked like this:
- Application pods receive document uploads via an internal API
- Azure AI Vision containers (running on AKS) handle OCR extraction
- A custom classification service (also on AKS) categorises the documents
- Results get written to a Cosmos DB for downstream processing
- Azure OpenAI (via Service Connector) handles summarisation for documents that don't contain restricted data
Everything runs in the same cluster. The compliance team was satisfied because restricted documents never left the cluster boundary. The development team was happy because they could deploy and update everything through their existing CI/CD pipelines. And the business was happy because document processing time dropped from minutes to seconds.
AKS being certified for Kubernetes AI Conformance matters here. It means the GPU scheduling, device plugins, and resource management features work as expected. We haven't had to fight the platform to get AI workloads running properly, which wasn't always the case a couple of years ago.
What to Watch Out For
I'd be doing you a disservice if I didn't mention the rough edges.
GPU node costs are real. A100 and H100 instances on Azure aren't cheap, and if you're not careful with node pool scaling, you'll be paying for expensive GPU nodes sitting idle. Set up cluster autoscaler with appropriate min/max counts, and consider using spot instances for development and testing workloads. We typically configure production GPU node pools with a minimum of zero and let KAITO's node provisioner handle scaling based on workspace demand.
Model updates require planning. When you're running models locally, you don't get automatic updates. New model versions mean new container images, new deployments, and potentially new GPU requirements. Build this into your operational runbooks from day one. We schedule quarterly model reviews with clients to evaluate whether newer versions are worth the upgrade effort.
Networking can get complicated. If you're running AI containers in a private AKS cluster with no public egress, even pulling the container images requires careful network configuration. Private endpoints for Azure Container Registry, DNS configuration for private link zones, and firewall rules for initial model downloads all need to be sorted out before you can deploy anything. Get your network architecture right first.
Monitoring is different. Traditional application monitoring doesn't capture what matters for AI workloads. You need to track inference latency (p50, p95, p99), GPU utilisation, model throughput (tokens per second), and queue depth. We set up custom Prometheus metrics and Grafana dashboards for every AI workload deployment. Azure Monitor for containers helps, but you'll likely need to supplement it.
When AKS Is and Isn't the Right Choice
AKS for AI workloads makes sense when you have genuine requirements around data residency, low latency, or cost optimisation at scale. It also makes sense when you're already running application workloads on AKS and want to co-locate your AI inference for simplicity.
It doesn't make sense if you're just getting started with AI. If you're still figuring out which models work for your use case, start with Azure OpenAI or Azure AI Foundry in the cloud. The iteration speed is much faster when you don't have to manage infrastructure. Once you've validated your approach and understand your production requirements, then evaluate whether AKS deployment is worth the operational overhead.
And it's probably overkill if you have a single, straightforward AI use case. Running one document processing pipeline doesn't justify a full AKS cluster with GPU nodes. The break-even point, in our experience, comes when you're running multiple AI workloads or when compliance requirements genuinely prevent cloud processing.
Getting This Right
The gap between "I deployed a model on Kubernetes" and "I have a production AI platform on AKS" is significant. It's the difference between a weekend project and a system that handles real business workloads reliably.
We help Australian organisations design and deploy AI workloads on Azure infrastructure through our Azure AI consulting practice. Whether you're evaluating AKS for compliance reasons, looking at KAITO for model serving, or trying to figure out the right architecture for a hybrid cloud-and-edge AI deployment - we've done it before and can help you avoid the pitfalls.
If you're running .NET applications and want to integrate AI capabilities into your existing Kubernetes workloads, our .NET consulting team works closely with the AI practice to make sure the application layer and the AI layer fit together properly.
The tooling for running AI on Kubernetes has improved dramatically over the past year. KAITO in particular has taken what used to be a week-long infrastructure project and compressed it into something you can stand up in an afternoon. But "can deploy" and "should deploy" are different questions, and the architecture decisions you make early on will determine whether your AI platform scales with your business or becomes another piece of infrastructure to maintain.
Get in touch if you want to talk through whether AKS is the right fit for your AI workloads. We'll give you an honest assessment, not a sales pitch.