Azure AI Foundry Pricing and Cost Management Tips
Almost every Azure AI Foundry project we run starts with the same conversation. The client has built a working prototype, the finance team has seen the first monthly bill, and somebody senior is asking why a "small AI experiment" cost $14,000 last month.
The reason is almost never that Foundry is overpriced. It is that Foundry has half a dozen separate pricing meters, most of which are off by default, and a single configuration choice can change your monthly cost by an order of magnitude. This piece walks through what you actually pay for, where the surprises hide, and the cost control patterns we use on Australian engagements.
I will keep this opinionated. Some of the official guidance is technically correct but practically useless. I will tell you what we actually do.
What you are actually paying for in Foundry
Azure AI Foundry is a workspace that sits on top of several Azure services. When you "use Foundry", you are usually paying for some combination of these:
- Model inference - the actual calls to GPT-4o, GPT-5, Claude, Llama, Mistral and so on
- Compute - if you fine-tune, run open source models, or host custom endpoints
- Storage - your training data, evaluation datasets, indexed documents
- Azure AI Search - if you are doing RAG (and you almost certainly are)
- Networking - private endpoints, data egress, hybrid connectivity
- Supporting services - Application Insights, Key Vault, Container Registry
Most prototypes only touch the first one or two. Most production systems touch all six. The jump from prototype bill to production bill is usually 5x to 15x, and that catches people out.
Model inference pricing - the part you have probably already seen
Foundry exposes a catalogue of models. Pricing is per million input and output tokens, billed in USD but appearing in AUD on your Australian Azure invoice. As of April 2026, the rough card looks like this (always check the live pricing page, models reprice every few months):
| Model | Input ($AUD per 1M tokens, approx) | Output ($AUD per 1M tokens, approx) |
|---|---|---|
| GPT-5 | $4.50 | $18.00 |
| GPT-4o | $3.80 | $15.50 |
| GPT-4o mini | $0.25 | $1.00 |
| GPT-4.1 | $3.00 | $12.00 |
| GPT-4.1 mini | $0.18 | $0.72 |
| o4-mini | $2.00 | $8.00 |
| Claude 4.6 Sonnet (Foundry) | $4.80 | $24.00 |
| Llama 3.3 70B (Serverless) | $1.10 | $1.65 |
| Phi-4 (Serverless) | $0.40 | $0.80 |
Take these as ballpark figures, not exact. Currency conversion, regional pricing differences (Sydney vs East US) and Microsoft's quarterly adjustments will all move them.
The thing nobody tells you: output tokens cost three to five times more than input tokens. This single fact drives most of the cost optimisation work we do.
The cost surprises (in order of how often they bite people)
1. The system prompt is sent every single call.
If your assistant has a 3,000 token system prompt and a 200 token user query, you are paying for 3,200 input tokens every call, not 200. Multiply by call volume and watch the bill move.
Fix: prompt caching. GPT-4o, GPT-4.1 and GPT-5 all support cached prompt prefixes at roughly 50 percent of the standard input rate. If you are not using it, you are paying double on every long system prompt. Our Azure AI Foundry consultants page goes into the caching patterns we use in production.
2. Embeddings on every search query.
If you are running RAG, you are likely calling an embedding model every time a user searches. text-embedding-3-large costs around $0.20 per million tokens, which sounds trivial until you realise some teams accidentally re-embed the entire knowledge base on every deployment.
Fix: cache embeddings aggressively. Re-embed only on document change. Use text-embedding-3-small unless you have measured that text-embedding-3-large actually improves your retrieval.
3. Azure AI Search costs more than the model.
For a typical enterprise RAG system, AI Search ends up being 40 to 60 percent of the total monthly bill. An S1 tier costs around $380 per month per replica per partition in Sydney. A small production setup with two replicas for redundancy is $760 plus. Larger systems easily hit $3,000 to $8,000 per month just for Search.
Fix: choose the right tier for your actual data volume. We see clients on S2 or S3 when they would be fine on S1. Also: vector compression. Use scalar or binary quantisation if you can tolerate a small accuracy hit. It can cut your index size (and cost) by 4x or more.
4. Provisioned Throughput Units (PTU) traps.
PTUs give you reserved capacity at a fixed monthly cost. They are the right answer for predictable high-volume workloads. They are the wrong answer for almost everything else.
A single PTU for GPT-4o is roughly $3,300 per month in AUD. Minimum purchase is usually 25 or 50 PTUs depending on the model. That is $82,500 to $165,000 per month committed. We have walked into engagements where someone bought PTUs to "save money" on a workload that was doing 500 calls per day. Their cost went up by 50x.
Fix: PTUs make sense when you need guaranteed latency under sustained load, or when your monthly token spend exceeds the PTU minimum on standard pricing. Otherwise stay on pay-as-you-go.
5. Fine-tuning costs.
Fine-tuning a model in Foundry has three cost components: the training itself (priced per training token, usually $30 to $80 per training run for small datasets), the hosted endpoint (priced per hour the deployment exists, often $4 to $8 per hour AUD), and the inference (slightly higher per-token rates than base models).
The hosted endpoint is the one that catches people. We have seen $1,200 monthly bills for a fine-tuned model that gets called twice a week, because nobody realised the endpoint bills 24/7 once deployed.
Fix: only fine-tune when prompt engineering and RAG have provably hit their limit. Most of the "we need to fine-tune" requests we see are solvable with better retrieval or a better system prompt.
The cost control patterns we use in production
Here is what we actually do, in order of impact.
Routing - small model for the easy stuff.
We rarely send every query to GPT-4o or GPT-5. A small classifier (often GPT-4o mini or Phi-4) routes queries based on complexity. Maybe 60 to 80 percent of queries can be handled by the mini model at a tenth of the cost. The expensive model only gets called when needed.
This single pattern typically cuts model spend by 50 to 70 percent. It is the highest-leverage optimisation we make.
Output token discipline.
Output is expensive. We constrain max_tokens aggressively, use structured outputs (JSON schemas) to prevent verbose responses, and explicitly tell the model to be concise where appropriate. A 200 token response instead of 800 saves real money at scale.
Streaming where it matters, batching where it does not.
Streaming improves perceived latency but does not change cost. For background processing (document analysis, batch summarisation, async workflows), use the batch API at 50 percent of the standard rate. We move every async workload we can to batch. The savings are real and the effort is small.
Caching layers.
Three caches we run in most production systems:
- Prompt prefix cache (built into the model, just enable it)
- Semantic cache for repeated queries (use Redis with vector similarity)
- Embedding cache (re-embed only on change)
Done right, you avoid paying for the same work twice.
Cost allocation tags.
Every Foundry resource gets tagged with cost-centre, application and environment. This sounds obvious but most clients have not done it. Without tagging you cannot answer "which app is driving the bill", which means you cannot have a useful cost conversation with finance.
Budgets and alerts on every meter.
Set monthly budgets per resource group with alerts at 50, 80 and 100 percent. Do not rely on the default Azure cost alerts at the subscription level - by the time they fire, you have already overspent.
A real example - what a mid-sized production Foundry deployment costs
Anonymised numbers from a real Australian client. Mid-market business, customer service AI assistant handling around 8,000 conversations per day, RAG over 50,000 internal documents, GPT-4o for the main responses, mini model for routing.
Monthly Azure bill, all components:
- GPT-4o inference (with caching and routing): $4,200
- GPT-4o mini for routing and classification: $380
- text-embedding-3-large (cached, re-embed on change only): $90
- Azure AI Search S1, 2 replicas, 1 partition: $760
- App Service hosting the API: $340
- Application Insights and logging: $180
- Storage and networking: $110
- Key Vault, Container Registry, misc: $60
Total: about $6,120 per month AUD.
Before optimisation (initial production setup), this client was running at $14,800 per month. Same workload. The differences were prompt caching, model routing, smaller AI Search tier with proper compression, and batch processing for the overnight document refresh.
That is a typical optimisation outcome. We aim for a 40 to 60 percent reduction on most engagements without losing functionality.
When to engage a consultant on cost
If any of these are true, getting outside help pays back fast:
- Your monthly Foundry bill is over $5,000 and you are not sure where the money is going
- You are considering PTUs (please talk to someone before committing to PTUs)
- You are about to fine-tune (same advice)
- Your bill has doubled month-on-month with no clear cause
- You are designing a new production deployment and want to avoid the common traps
A two to three day cost review typically costs $8,000 to $15,000 in consulting fees and finds savings of $3,000 to $20,000 per month in the cases where there is something to find. We have done a few of these where we did not find anything significant, and we tell the client that. Honest answers are cheaper than billable hours.
Practical cost checklist
Before your next Foundry production deployment, check:
- Prompt caching enabled on every model that supports it
- Routing layer using a small model for low-complexity queries
- max_tokens set on every model call
- AI Search tier sized to actual data volume, not vendor recommendation
- Vector compression enabled where accuracy allows
- Batch API used for all async workloads
- Cost allocation tags on every resource
- Budget alerts at 50, 80, 100 percent on every resource group
- Embedding cache in place
- No PTUs unless you have validated the workload justifies them
- No hosted fine-tuned endpoints unless they are getting real traffic
If you can tick every box, you are doing better than 90 percent of the deployments we audit.
Where Team 400 fits in
We are an Australian AI consulting firm specialising in Azure-based deployments. Most of our work is with mid-market and enterprise businesses across Sydney, Brisbane, the Sunshine Coast and remote nationally. Our Foundry practice covers everything from initial architecture through to cost optimisation reviews on production deployments.
If you want a sanity check on your current Foundry spend or are planning a deployment and want the cost model done properly upfront, get in touch for a 30 minute call. We will either help you, point you to someone who can, or tell you to leave things alone for another quarter. All three are common outcomes.
For broader context on the Azure AI ecosystem, our Azure AI consulting service page covers the wider engagement model, and the Microsoft AI consultants page sets out how we work across the full Microsoft AI stack.