How to Build a Custom AI Model in Azure AI Foundry
The question I get most from Australian buyers exploring custom AI is some version of this - "We've tried GPT-4o with a good prompt and it gets us to about 78%. We need 95%. Do we fine-tune, do we train from scratch, or do we throw more retrieval at it?"
There is no honest answer that fits in a single sentence. But there is a process. After running custom model projects through Azure AI Foundry for clients in legal, mortgage, manufacturing and government over the last eighteen months, the path from "we want a custom model" to "we have a custom model earning its keep in production" has stopped being mysterious. It has steps, costs, and a couple of points where most projects either succeed or quietly stall.
This post is the version of that conversation I wish I could send people before our first call. It is opinionated. It will save you money if you are about to spend money in the wrong place.
When a custom model is the right answer (and when it is not)
Before any consultant takes your money to build a custom model, the honest question is whether you need one at all. We have walked away from at least three engagements where the client wanted custom training and we told them they did not need it. They were right that GPT-4o was not solving their problem. They were wrong about why.
A custom model is the right answer when:
- The task is repetitive and well-defined (classification, extraction, routing) and you have or can build several hundred labelled examples
- You need consistent output format and tone that prompting cannot reliably produce
- You have proprietary patterns the base model cannot infer from context (specific scoring rubrics, internal taxonomies, industry shorthand)
- Inference cost matters and you want a smaller, cheaper model behaving like a larger one
- Latency requirements rule out the largest frontier models
A custom model is the wrong answer when:
- The model needs new factual knowledge (use retrieval-augmented generation instead)
- You have fewer than 100 labelled examples and no realistic way to create more
- Your accuracy problem is actually a data quality problem upstream of the model
- The task is broad and creative rather than narrow and repeatable
- The base model genuinely cannot do the task at all (fine-tuning will not teach it to)
The hardest of these to accept is the fourth one. We had a financial services client convinced that a fine-tuned model would solve their "AI doesn't sound like our brand voice" problem. The actual problem was that no two people inside their business agreed on what the brand voice was. We spent three weeks on a style guide instead of three months on training data. The prompt-engineered solution shipped in six weeks.
The four custom-model paths in Azure AI Foundry
Foundry gives you four practical paths to a custom model. Picking the wrong one is one of the more expensive mistakes you can make this year.
Path 1 - Supervised fine-tuning of a hosted model
This is the default for most buyers. You pick a base model (GPT-4o, GPT-4o mini, GPT-4.1 mini, Llama 3.1 8B, Mistral 7B, or one of the Phi-4 family), upload a few hundred to a few thousand input/output examples, and Foundry handles the training. You get back a deployable endpoint that behaves like a tuned version of the base model.
This is the right path when you have between 200 and 5,000 good examples and your task fits a chat or completion shape. We use this for classification, extraction, structured generation, and tone matching. Most of our Foundry custom-model work this year has been here.
Path 2 - Distillation
You take a frontier model (GPT-4o, Claude Opus, GPT-4.1) and use it to generate high-quality outputs for a much larger set of inputs. Then you fine-tune a smaller, cheaper model on those outputs. The result is a small model that punches above its weight, at a fraction of the inference cost.
Distillation is underrated. It is the right path when your task is well-defined, your input volumes are high, and inference cost is the constraint. We took a client running 4 million GPT-4o calls per month down to a tuned GPT-4o mini doing 95% as well at roughly 30% of the cost. That was a $58,000 AUD per month saving. It paid for the build in about six weeks.
Path 3 - LoRA / adapter fine-tuning on an open-weight model
LoRA (Low-Rank Adaptation) lets you train small adapters on top of an open-weight model like Llama or Mistral rather than retraining the full model. Foundry supports this through the Models-as-a-Service deployments and through your own compute if you want full control.
LoRA is the right path when you need multiple specialised versions of the same base model (one per business unit, one per language, one per customer), or when you want to keep the underlying weights stable and only swap small adapters. The adapters are tiny, fast to train, and cheap to host.
Path 4 - Full custom training
Pre-training a model from scratch. Almost no buyer should be doing this. We have done it twice in the last two years, both times for clients with domain data that genuinely had no public equivalent (highly specialised scientific text in one case, a multilingual mining-industry corpus in the other). Each project was six months and seven figures. If you have to ask whether this is you, it is not.
A realistic cost breakdown in AUD
The Azure compute bill is rarely the biggest line item on a custom model project. Here is what real budgets look like for the projects we run.
Small fine-tuning (GPT-4o mini, 500-1000 examples)
| Item | Range AUD |
|---|---|
| Data preparation and labelling | $8,000 - $25,000 |
| Fine-tuning compute (3-6 iterations) | $200 - $1,500 |
| Evaluation framework setup | $4,000 - $10,000 |
| Deployment and integration | $6,000 - $15,000 |
| Hosted deployment per month | $300 - $1,200 |
| Total build cost | $18,000 - $51,500 |
Mid-size fine-tuning (GPT-4o, 2000-5000 examples)
| Item | Range AUD |
|---|---|
| Data preparation and labelling | $25,000 - $80,000 |
| Fine-tuning compute (5-10 iterations) | $1,500 - $8,000 |
| Evaluation framework setup | $8,000 - $20,000 |
| Deployment and integration | $10,000 - $30,000 |
| Hosted deployment per month | $1,500 - $6,000 |
| Total build cost | $44,500 - $138,000 |
Distillation (frontier to mini)
| Item | Range AUD |
|---|---|
| Synthetic dataset generation (frontier model calls) | $3,000 - $15,000 |
| Quality review of generated data | $10,000 - $30,000 |
| Tuning of student model | $500 - $3,000 |
| Evaluation and parallel running | $10,000 - $25,000 |
| Total build cost | $23,500 - $73,000 |
| Typical inference saving | 50 - 75% per million tokens |
The line item that surprises buyers every time is data preparation. If your data is not already labelled and clean, expect this to consume at least 40% of the budget. We have seen it consume 70% on projects with messy source data.
Step by step - what we actually do on a Foundry custom model project
This is the workflow we follow when a client engages us to build a custom model in Azure AI Foundry. Real timeline, real artefacts, in the order they happen.
Week 1 - Baseline establishment. Before anything else, we build a benchmark dataset of 100-300 real examples with verified correct outputs. We then measure how well GPT-4o, GPT-4o mini, and one cheaper open-weight model perform with good prompting. This number is the bar your custom model has to clear to be worth the investment.
Week 1-2 - Approach decision. Based on the baseline, the volume requirements, and the cost constraints, we pick one of the four paths above. We write a short decision memo with the reasoning so the buyer can see why we chose what we chose. This memo is what keeps everyone honest later when someone asks "should we have just used a bigger prompt?"
Week 2-4 - Data preparation. This is the heavy lifting. We work with the client's subject matter experts to build a labelled dataset. We do not let the consultants write the training data alone - the people who actually do the work need to validate every example. We use Foundry's data preparation tooling for format conversion and validation.
Week 3-5 - First training run. We do a small initial fine-tune (typically 100-200 examples, 1-2 epochs) to confirm the pipeline works and the data shape is right. This is a cheap sanity check. We have caught format errors here that would have wasted thousands of dollars in compute on a bigger run.
Week 4-6 - Full training and evaluation. Full fine-tune with the complete dataset. Evaluation against the benchmark from week 1. This is where you find out if your custom model is actually better than the base model. About 80% of the time it is. About 20% of the time it is not, and we go back to better prompting or RAG.
Week 6-8 - Production deployment. Endpoint deployment, integration with the client's application, monitoring setup. We deploy with the base model still available as a fallback so the team can A/B compare in production.
Week 8+ - Continuous evaluation. A custom model in production needs ongoing measurement. We set up automated evaluation runs that hit the model with a frozen test set weekly and alert if accuracy drifts.
The whole sequence usually takes six to ten weeks for a focused project. Projects that take longer are usually stuck on data, not technology.
What a good evaluation framework looks like
Most custom model projects that fail in production fail because the evaluation framework was weak. The buyer thought the model was working because the consultant said so, but no one ever measured it against a stable benchmark.
A real evaluation framework in Foundry has:
- A frozen benchmark dataset, version controlled, that does not change
- Automated runs that produce a metric score for every model version
- Per-category breakdowns, not just an overall accuracy number
- Cost-per-correct-output as a tracked metric, not just raw accuracy
- A human review sample of 50 outputs per release for tasks where quality is subjective
- Drift detection that flags when production inputs start to look different from the benchmark
Foundry's built-in evaluation tools handle most of this. The work is in defining the benchmark and the metrics that matter for your use case. We always insist on this being agreed before training starts, not after. Once a model is built, everyone's incentive is to declare victory.
Common buyer mistakes we see in 2026
The mistakes have not changed much in the last year. They have just become more expensive because more is being spent on Azure AI.
Buying training before buying baselining. If you do not know what the base model can do with a good prompt, you have no way of telling whether fine-tuning helped.
Outsourcing the labelling. Cheap labelling produces cheap data. Your subject matter experts have to be involved.
Optimising the wrong metric. A model that scores 92% on accuracy but routinely makes catastrophic errors on the 8% can be worse than a model at 85% with safer failure modes.
Treating the build as one-and-done. Models drift. Inputs change. Your custom model needs an owner and a budget for ongoing maintenance.
Picking the wrong base model. Fine-tuning GPT-4o when GPT-4o mini would have worked costs roughly 5x more per inference call forever. The base model choice is a thirty-thousand-dollar decision compressed into a dropdown.
When to bring in a consultant (and what to expect)
Honest assessment - you can run a small fine-tuning project in Foundry without external help if you have a strong machine learning engineer on staff and a clear use case. Microsoft's documentation has improved and the tooling is reasonable.
You probably want help when:
- This is your first production AI project and you need the evaluation discipline embedded
- Your data is messy, sensitive, or scattered across systems
- You need to operate inside compliance constraints (APRA CPS 230, healthcare, government)
- You are sizing the investment and need an honest read on whether to spend at all
- You have a deadline and cannot afford the time it takes to discover the failure modes yourself
A reasonable engagement to scope, build, and deploy a first Foundry custom model in Australia costs between $40,000 and $150,000 AUD depending on data volume and complexity. Anyone quoting you under $25,000 is either skipping evaluation or planning to use you as a learning exercise. Anyone quoting over $300,000 for a single model should be questioned hard on what is included.
If you want a conversation about whether your use case justifies a custom model and which path makes sense, that is the kind of work we do every week. Have a look at our Azure AI Foundry consulting, our broader custom AI development, or get in touch through contact. We work with teams across Sydney, Melbourne and Brisbane, and we will tell you honestly if the answer is no custom model at all.