How to Build a Custom AI Model in Azure AI Foundry

April 9, 2026•11 min read•Michael Ridland

"Off-the-shelf models aren't accurate enough for our use case." We hear this from about half of our clients. Sometimes they're right - a custom model genuinely performs better. Other times, the issue is with the prompt or the data retrieval, not the model itself.

This guide covers when you actually need a custom model in Azure AI Foundry, how to build one, and how to know if it's working well enough for production.

Do You Actually Need a Custom Model?

Before you invest in fine-tuning or custom model training, rule out the cheaper alternatives first. Here's the decision tree we use with clients:

Step 1 - Try prompt engineering first. A well-written system prompt with clear instructions, examples, and output format specifications solves 60-70% of accuracy problems. We've seen clients go from 65% accuracy to 90% just by improving their prompt.

Step 2 - Try RAG (Retrieval Augmented Generation). If the model needs access to your specific knowledge, set up Azure AI Search to retrieve relevant documents before the model generates a response. This solves most "the model doesn't know about our business" problems without any model customisation.

Step 3 - Try few-shot examples. Include 3-5 examples of ideal input/output pairs in your prompt. This teaches the model your expected format and style without any training.

Step 4 - Fine-tune. If Steps 1-3 don't get you to acceptable accuracy, fine-tuning is the next option. This adjusts the model's weights using your specific data.

Step 5 - Train from scratch. Rarely needed for most business use cases. Reserved for highly specialised domains with enough training data.

In our experience, about 30% of clients who think they need fine-tuning actually need better prompt engineering. Another 30% need RAG. The remaining 40% genuinely benefit from fine-tuning or custom training.

Understanding Fine-Tuning in Azure AI Foundry

Fine-tuning takes a pre-trained model and continues training it on your specific data. The result is a model that retains the general capabilities of the base model while performing better on your particular task.

What Fine-Tuning Does Well

Learning your specific output format: If you need responses in a very particular structure (JSON schema, specific terminology, company voice)
Improving accuracy on domain-specific tasks: Medical coding, legal document classification, technical support routing
Reducing prompt length: Fine-tuned models need less instruction in the prompt because the training data teaches them what you expect
Consistent behaviour: Less variation in outputs for similar inputs

What Fine-Tuning Does Not Do

Teach new knowledge: Fine-tuning adjusts how the model responds, not what it knows. For new knowledge, use RAG.
Fix fundamental model limitations: If GPT-4o can't do the task, fine-tuning GPT-4o probably won't fix it
Guarantee perfect accuracy: You'll improve accuracy, but AI models remain probabilistic
Eliminate the need for evaluation: You still need to test and monitor

Preparing Your Training Data

The quality of your training data determines the quality of your fine-tuned model. This is the step that most teams underestimate.

Data Format

Azure AI Foundry expects training data in JSONL format (one JSON object per line). For chat models, each line looks like:

{"messages": [{"role": "system", "content": "You are a document classifier for an Australian law firm."}, {"role": "user", "content": "Classify this document: [document text]"}, {"role": "assistant", "content": "Category: Contract Amendment\nConfidence: High\nKey terms: variation clause, effective date, counterparty consent"}]}

How Much Data Do You Need?

Goal	Minimum Examples	Recommended
Adjust output format/style	50-100	200-500
Improve domain accuracy	200-500	500-2,000
Specialised classification	100-300 per category	500+ per category
Complex domain-specific tasks	500-1,000	2,000-5,000

More data generally means better results, but quality matters more than quantity. 200 carefully curated examples outperform 2,000 noisy ones.

Data Quality Checklist

Before you start fine-tuning, verify your training data against these criteria:

Accurate: Every example shows the correct output for the given input
Representative: Examples cover the full range of inputs you expect in production
Balanced: No category or pattern is massively over- or under-represented
Consistent: Similar inputs have similar outputs (no contradictory examples)
Clean: No formatting errors, encoding issues, or incomplete examples
Properly anonymised: No personal data unless you have explicit consent and a legal basis

The most common data quality issue we see: Training data that reflects what people think the right answer is rather than what it actually is. If your training data has errors, your model will confidently reproduce those errors.

Creating Training Data from Scratch

If you don't have existing labelled data, here are practical approaches:

Expert annotation: Have domain experts manually create input/output pairs. This is the most reliable method but also the most expensive. Budget 5-10 minutes per example for complex tasks.

Synthetic data generation: Use a more capable model (like GPT-4o) to generate training examples, then have domain experts review and correct them. This is faster than pure manual creation but requires careful quality control.

Bootstrapping from production data: If you already have a prompt-based system running, collect the best examples of correct outputs and use them as training data. Only include examples where the output was verified as correct.

We typically recommend a combination: generate an initial dataset synthetically, have domain experts review and correct it, then augment with real production examples over time.

Fine-Tuning Step by Step in Azure AI Foundry

Step 1 - Upload Your Training Data

In your Azure AI Foundry project, go to "Data"
Upload your JSONL training file
Also upload a separate JSONL validation file (10-20% of your total examples, held out from training)

The validation set is important. It's how you measure whether the model is actually learning useful patterns versus memorising your training data.

Step 2 - Create a Fine-Tuning Job

Go to "Fine-tuning" in your project
Select the base model you want to fine-tune

Which base model to fine-tune? Our recommendations:

Scenario	Recommended Base Model
Best quality, cost is secondary	GPT-4o
Good quality, cost-conscious	GPT-4o mini
High volume, cost-sensitive	Llama 3.1 8B or Mistral 7B
Already using a specific model	Fine-tune the model you're already using

Select your training and validation datasets
Configure hyperparameters:

Number of epochs: Start with 3. More epochs means more passes through the training data. Too many leads to overfitting.
Batch size: Leave at default unless you have a specific reason to change it
Learning rate multiplier: Start with the default (usually 1.0). Decrease if the model overfits; increase if it's not learning fast enough.

Start the fine-tuning job

Step 3 - Monitor Training

Watch the training and validation loss curves:

Training loss should decrease: This means the model is learning from your data
Validation loss should also decrease: This means the learning generalises beyond the training data
If validation loss increases while training loss decreases: The model is overfitting. Reduce epochs or add more training data.

Fine-tuning GPT-4o mini typically takes 15-60 minutes depending on dataset size. GPT-4o takes 30 minutes to several hours. Larger open-source models can take longer, especially with bigger datasets.

Step 4 - Deploy the Fine-Tuned Model

Once training completes:

Review the training metrics
Deploy the fine-tuned model to an endpoint
The deployment process is the same as deploying a base model - you get an endpoint URL and API key

Important: Your fine-tuned model has a separate deployment cost from the base model. Fine-tuned GPT-4o mini, for example, costs more per token than the base model. Factor this into your cost projections.

Evaluating Your Custom Model

Deploying a fine-tuned model without proper evaluation is like shipping software without testing. Here's the evaluation framework we use.

Automated Evaluation

Azure AI Foundry includes built-in evaluation tools. Set up an evaluation flow that:

Runs your validation dataset through the fine-tuned model
Compares outputs to expected outputs
Calculates metrics:

Metric	What It Measures	Target
Accuracy	Exact match with expected output	> 85% for classification tasks
F1 Score	Balance of precision and recall	> 0.8 for classification
Groundedness	Response supported by source data	> 90% for RAG applications
Coherence	Response clarity and logic	> 4.0 on a 5-point scale
Similarity	Semantic similarity to expected output	> 0.85 for generation tasks

A/B Comparison

Always compare your fine-tuned model against the base model with good prompt engineering. Run the same evaluation dataset through both and compare:

Does fine-tuning actually improve accuracy, or is it within the margin of error?
How much does it improve? Is the improvement worth the additional cost and complexity?
Are there specific categories where fine-tuning helps a lot and others where it makes no difference?

We've had cases where fine-tuning improved overall accuracy by 12% but actually made performance worse on a specific category of inputs. Without category-level evaluation, we wouldn't have caught that.

Human Evaluation

For tasks where automated metrics don't capture quality well (writing, summarisation, complex reasoning), include human evaluation:

Show evaluators the model output without telling them which model produced it
Ask them to rate on relevant criteria (accuracy, helpfulness, appropriateness)
Compare ratings between the base model and fine-tuned model

This is time-consuming but important for high-stakes applications.

Production Monitoring

Evaluation doesn't end at deployment. Set up ongoing monitoring for:

Accuracy drift: Are outputs becoming less accurate over time? This can happen as the types of inputs change.
Latency: Is the model responding within acceptable timeframes?
Cost per query: Are token counts in line with expectations?
User feedback: If end users can rate responses, track satisfaction over time.

Advanced Approaches - When Fine-Tuning Isn't Enough

For some use cases, standard fine-tuning won't get you to the accuracy you need. Here are the next steps:

Distillation

Use a large, expensive model (GPT-4o) to generate high-quality outputs for your training data, then fine-tune a smaller, cheaper model (GPT-4o mini or Phi-3) on those outputs. The smaller model learns to mimic the larger model's behaviour at a fraction of the inference cost.

We've used this approach to reduce inference costs by 70% while maintaining 95% of the accuracy of the larger model. It works best when your task is well-defined and consistent.

Ensemble Approaches

Run the same input through multiple models and combine their outputs. For classification tasks, take the majority vote. For generation tasks, use a model to select the best output from several candidates.

This increases cost (you're running multiple models) but can improve accuracy and reliability for high-stakes decisions.

Continuous Fine-Tuning

Rather than fine-tuning once, set up a pipeline that periodically retrains the model on new, verified examples from production. This keeps the model current as your data and requirements evolve.

Azure AI Foundry supports this through scheduled fine-tuning jobs and model versioning. Each new version gets its own evaluation before replacing the previous one in production.

Cost Considerations for Custom Models

Custom model development has costs beyond the obvious Azure compute charges:

Data preparation: The biggest hidden cost. Expect 40-100 hours of work to create and validate a quality training dataset for a moderately complex task. At typical consulting rates, that's $8,000-$25,000 AUD.

Fine-tuning compute: For GPT-4o mini, expect $5-$50 AUD per fine-tuning run depending on dataset size. For larger models, $50-$500+ AUD. You'll typically run 3-10 iterations to optimise.

Increased inference cost: Fine-tuned models cost more per token than base models. For GPT-4o mini, roughly 2-3x the base model rate.

Ongoing maintenance: Models need periodic re-evaluation and potential retraining. Budget 10-20 hours per quarter for maintenance.

Total cost for a typical fine-tuning project: $15,000-$50,000 AUD including consulting, data preparation, training, and deployment. This is justified when the accuracy improvement translates to measurable business value - fewer errors, faster processing, reduced manual review.

When to Call in Help

Building custom AI models is technically accessible through Azure AI Foundry, but doing it well requires experience with data preparation, evaluation design, and production deployment patterns. The difference between a model that works in testing and one that works reliably in production is significant.

At Team 400, we've built custom models across industries including financial services, professional services, manufacturing, and government. We handle the full lifecycle from data preparation through to production monitoring.

If you're considering a custom model project, talk to us before you invest in data preparation. We can often tell within a single conversation whether fine-tuning is the right approach or whether better prompt engineering and RAG would get you the same result at lower cost.

Explore our Azure AI Foundry consulting or our broader AI consulting services to learn more about how we work. We're Microsoft AI consultants who build custom models in production, not just in demos.