How to Build a Custom AI Model in Azure AI Foundry
"Off-the-shelf models aren't accurate enough for our use case." We hear this from about half of our clients. Sometimes they're right - a custom model genuinely performs better. Other times, the issue is with the prompt or the data retrieval, not the model itself.
This guide covers when you actually need a custom model in Azure AI Foundry, how to build one, and how to know if it's working well enough for production.
Do You Actually Need a Custom Model?
Before you invest in fine-tuning or custom model training, rule out the cheaper alternatives first. Here's the decision tree we use with clients:
Step 1 - Try prompt engineering first. A well-written system prompt with clear instructions, examples, and output format specifications solves 60-70% of accuracy problems. We've seen clients go from 65% accuracy to 90% just by improving their prompt.
Step 2 - Try RAG (Retrieval Augmented Generation). If the model needs access to your specific knowledge, set up Azure AI Search to retrieve relevant documents before the model generates a response. This solves most "the model doesn't know about our business" problems without any model customisation.
Step 3 - Try few-shot examples. Include 3-5 examples of ideal input/output pairs in your prompt. This teaches the model your expected format and style without any training.
Step 4 - Fine-tune. If Steps 1-3 don't get you to acceptable accuracy, fine-tuning is the next option. This adjusts the model's weights using your specific data.
Step 5 - Train from scratch. Rarely needed for most business use cases. Reserved for highly specialised domains with enough training data.
In our experience, about 30% of clients who think they need fine-tuning actually need better prompt engineering. Another 30% need RAG. The remaining 40% genuinely benefit from fine-tuning or custom training.
Understanding Fine-Tuning in Azure AI Foundry
Fine-tuning takes a pre-trained model and continues training it on your specific data. The result is a model that retains the general capabilities of the base model while performing better on your particular task.
What Fine-Tuning Does Well
- Learning your specific output format: If you need responses in a very particular structure (JSON schema, specific terminology, company voice)
- Improving accuracy on domain-specific tasks: Medical coding, legal document classification, technical support routing
- Reducing prompt length: Fine-tuned models need less instruction in the prompt because the training data teaches them what you expect
- Consistent behaviour: Less variation in outputs for similar inputs
What Fine-Tuning Does Not Do
- Teach new knowledge: Fine-tuning adjusts how the model responds, not what it knows. For new knowledge, use RAG.
- Fix fundamental model limitations: If GPT-4o can't do the task, fine-tuning GPT-4o probably won't fix it
- Guarantee perfect accuracy: You'll improve accuracy, but AI models remain probabilistic
- Eliminate the need for evaluation: You still need to test and monitor
Preparing Your Training Data
The quality of your training data determines the quality of your fine-tuned model. This is the step that most teams underestimate.
Data Format
Azure AI Foundry expects training data in JSONL format (one JSON object per line). For chat models, each line looks like:
{"messages": [{"role": "system", "content": "You are a document classifier for an Australian law firm."}, {"role": "user", "content": "Classify this document: [document text]"}, {"role": "assistant", "content": "Category: Contract Amendment\nConfidence: High\nKey terms: variation clause, effective date, counterparty consent"}]}
How Much Data Do You Need?
| Goal | Minimum Examples | Recommended |
|---|---|---|
| Adjust output format/style | 50-100 | 200-500 |
| Improve domain accuracy | 200-500 | 500-2,000 |
| Specialised classification | 100-300 per category | 500+ per category |
| Complex domain-specific tasks | 500-1,000 | 2,000-5,000 |
More data generally means better results, but quality matters more than quantity. 200 carefully curated examples outperform 2,000 noisy ones.
Data Quality Checklist
Before you start fine-tuning, verify your training data against these criteria:
- Accurate: Every example shows the correct output for the given input
- Representative: Examples cover the full range of inputs you expect in production
- Balanced: No category or pattern is massively over- or under-represented
- Consistent: Similar inputs have similar outputs (no contradictory examples)
- Clean: No formatting errors, encoding issues, or incomplete examples
- Properly anonymised: No personal data unless you have explicit consent and a legal basis
The most common data quality issue we see: Training data that reflects what people think the right answer is rather than what it actually is. If your training data has errors, your model will confidently reproduce those errors.
Creating Training Data from Scratch
If you don't have existing labelled data, here are practical approaches:
Expert annotation: Have domain experts manually create input/output pairs. This is the most reliable method but also the most expensive. Budget 5-10 minutes per example for complex tasks.
Synthetic data generation: Use a more capable model (like GPT-4o) to generate training examples, then have domain experts review and correct them. This is faster than pure manual creation but requires careful quality control.
Bootstrapping from production data: If you already have a prompt-based system running, collect the best examples of correct outputs and use them as training data. Only include examples where the output was verified as correct.
We typically recommend a combination: generate an initial dataset synthetically, have domain experts review and correct it, then augment with real production examples over time.
Fine-Tuning Step by Step in Azure AI Foundry
Step 1 - Upload Your Training Data
- In your Azure AI Foundry project, go to "Data"
- Upload your JSONL training file
- Also upload a separate JSONL validation file (10-20% of your total examples, held out from training)
The validation set is important. It's how you measure whether the model is actually learning useful patterns versus memorising your training data.
Step 2 - Create a Fine-Tuning Job
- Go to "Fine-tuning" in your project
- Select the base model you want to fine-tune
Which base model to fine-tune? Our recommendations:
| Scenario | Recommended Base Model |
|---|---|
| Best quality, cost is secondary | GPT-4o |
| Good quality, cost-conscious | GPT-4o mini |
| High volume, cost-sensitive | Llama 3.1 8B or Mistral 7B |
| Already using a specific model | Fine-tune the model you're already using |
- Select your training and validation datasets
- Configure hyperparameters:
- Number of epochs: Start with 3. More epochs means more passes through the training data. Too many leads to overfitting.
- Batch size: Leave at default unless you have a specific reason to change it
- Learning rate multiplier: Start with the default (usually 1.0). Decrease if the model overfits; increase if it's not learning fast enough.
- Start the fine-tuning job
Step 3 - Monitor Training
Watch the training and validation loss curves:
- Training loss should decrease: This means the model is learning from your data
- Validation loss should also decrease: This means the learning generalises beyond the training data
- If validation loss increases while training loss decreases: The model is overfitting. Reduce epochs or add more training data.
Fine-tuning GPT-4o mini typically takes 15-60 minutes depending on dataset size. GPT-4o takes 30 minutes to several hours. Larger open-source models can take longer, especially with bigger datasets.
Step 4 - Deploy the Fine-Tuned Model
Once training completes:
- Review the training metrics
- Deploy the fine-tuned model to an endpoint
- The deployment process is the same as deploying a base model - you get an endpoint URL and API key
Important: Your fine-tuned model has a separate deployment cost from the base model. Fine-tuned GPT-4o mini, for example, costs more per token than the base model. Factor this into your cost projections.
Evaluating Your Custom Model
Deploying a fine-tuned model without proper evaluation is like shipping software without testing. Here's the evaluation framework we use.
Automated Evaluation
Azure AI Foundry includes built-in evaluation tools. Set up an evaluation flow that:
- Runs your validation dataset through the fine-tuned model
- Compares outputs to expected outputs
- Calculates metrics:
| Metric | What It Measures | Target |
|---|---|---|
| Accuracy | Exact match with expected output | > 85% for classification tasks |
| F1 Score | Balance of precision and recall | > 0.8 for classification |
| Groundedness | Response supported by source data | > 90% for RAG applications |
| Coherence | Response clarity and logic | > 4.0 on a 5-point scale |
| Similarity | Semantic similarity to expected output | > 0.85 for generation tasks |
A/B Comparison
Always compare your fine-tuned model against the base model with good prompt engineering. Run the same evaluation dataset through both and compare:
- Does fine-tuning actually improve accuracy, or is it within the margin of error?
- How much does it improve? Is the improvement worth the additional cost and complexity?
- Are there specific categories where fine-tuning helps a lot and others where it makes no difference?
We've had cases where fine-tuning improved overall accuracy by 12% but actually made performance worse on a specific category of inputs. Without category-level evaluation, we wouldn't have caught that.
Human Evaluation
For tasks where automated metrics don't capture quality well (writing, summarisation, complex reasoning), include human evaluation:
- Show evaluators the model output without telling them which model produced it
- Ask them to rate on relevant criteria (accuracy, helpfulness, appropriateness)
- Compare ratings between the base model and fine-tuned model
This is time-consuming but important for high-stakes applications.
Production Monitoring
Evaluation doesn't end at deployment. Set up ongoing monitoring for:
- Accuracy drift: Are outputs becoming less accurate over time? This can happen as the types of inputs change.
- Latency: Is the model responding within acceptable timeframes?
- Cost per query: Are token counts in line with expectations?
- User feedback: If end users can rate responses, track satisfaction over time.
Advanced Approaches - When Fine-Tuning Isn't Enough
For some use cases, standard fine-tuning won't get you to the accuracy you need. Here are the next steps:
Distillation
Use a large, expensive model (GPT-4o) to generate high-quality outputs for your training data, then fine-tune a smaller, cheaper model (GPT-4o mini or Phi-3) on those outputs. The smaller model learns to mimic the larger model's behaviour at a fraction of the inference cost.
We've used this approach to reduce inference costs by 70% while maintaining 95% of the accuracy of the larger model. It works best when your task is well-defined and consistent.
Ensemble Approaches
Run the same input through multiple models and combine their outputs. For classification tasks, take the majority vote. For generation tasks, use a model to select the best output from several candidates.
This increases cost (you're running multiple models) but can improve accuracy and reliability for high-stakes decisions.
Continuous Fine-Tuning
Rather than fine-tuning once, set up a pipeline that periodically retrains the model on new, verified examples from production. This keeps the model current as your data and requirements evolve.
Azure AI Foundry supports this through scheduled fine-tuning jobs and model versioning. Each new version gets its own evaluation before replacing the previous one in production.
Cost Considerations for Custom Models
Custom model development has costs beyond the obvious Azure compute charges:
Data preparation: The biggest hidden cost. Expect 40-100 hours of work to create and validate a quality training dataset for a moderately complex task. At typical consulting rates, that's $8,000-$25,000 AUD.
Fine-tuning compute: For GPT-4o mini, expect $5-$50 AUD per fine-tuning run depending on dataset size. For larger models, $50-$500+ AUD. You'll typically run 3-10 iterations to optimise.
Increased inference cost: Fine-tuned models cost more per token than base models. For GPT-4o mini, roughly 2-3x the base model rate.
Ongoing maintenance: Models need periodic re-evaluation and potential retraining. Budget 10-20 hours per quarter for maintenance.
Total cost for a typical fine-tuning project: $15,000-$50,000 AUD including consulting, data preparation, training, and deployment. This is justified when the accuracy improvement translates to measurable business value - fewer errors, faster processing, reduced manual review.
When to Call in Help
Building custom AI models is technically accessible through Azure AI Foundry, but doing it well requires experience with data preparation, evaluation design, and production deployment patterns. The difference between a model that works in testing and one that works reliably in production is significant.
At Team 400, we've built custom models across industries including financial services, professional services, manufacturing, and government. We handle the full lifecycle from data preparation through to production monitoring.
If you're considering a custom model project, talk to us before you invest in data preparation. We can often tell within a single conversation whether fine-tuning is the right approach or whether better prompt engineering and RAG would get you the same result at lower cost.
Explore our Azure AI Foundry consulting or our broader AI consulting services to learn more about how we work. We're Microsoft AI consultants who build custom models in production, not just in demos.