How to Fine-Tune AI Models for Your Industry
Should you fine-tune an AI model for your business? And if so, how do you actually do it without wasting months and budget on a model that performs worse than the base version?
These are questions we get regularly at Team 400. The honest answer is that most organisations don't need fine-tuning - but when you do, the performance gains can be significant. Here's how to know which camp you fall into and how to do it right.
I'm Michael Ridland, founder of Team 400, and we've helped Australian businesses across financial services, manufacturing, healthcare, and professional services get more from AI. Let me walk through what fine-tuning actually involves.
What Is Fine-Tuning and Why Would You Do It?
Fine-tuning takes a pre-trained AI model (like GPT-4o, Claude, or Llama) and trains it further on your own data. The model keeps its general capabilities but gets better at your specific tasks.
Think of it like hiring an experienced professional versus training a graduate. The base model is the experienced professional - capable across many areas. Fine-tuning is like giving that professional six months of on-the-job training in your exact domain. They already know the fundamentals, but now they understand your terminology, your processes, and your edge cases.
Common reasons to fine-tune:
- Domain-specific language - Your industry has terminology, abbreviations, and conventions that general models struggle with
- Consistent output format - You need the model to always respond in a specific structure
- Improved accuracy on specialised tasks - Classification, extraction, or analysis in your specific domain
- Reducing prompt length - Fine-tuning can bake instructions into the model, reducing token usage at inference time
- Behaviour alignment - Making the model respond in a way that matches your brand or compliance requirements
When NOT to Fine-Tune
Before spending time and money on fine-tuning, rule out simpler approaches.
Try prompt engineering first. A well-written system prompt with clear instructions and examples (few-shot prompting) often gets you 80-90% of the way there. We've seen teams assume they need fine-tuning when a better prompt would have solved the problem.
Try RAG (Retrieval-Augmented Generation) next. If the model needs access to your company's knowledge - product details, policies, procedures - RAG is usually better than fine-tuning. RAG retrieves relevant information at query time and provides it as context. Fine-tuning bakes knowledge into model weights, which means it goes stale and is expensive to update.
Consider the volume. Fine-tuning requires hundreds to thousands of high-quality training examples. If you don't have this data, or can't create it, fine-tuning isn't viable.
Here's our decision framework:
| Situation | Best Approach |
|---|---|
| Model needs company-specific knowledge | RAG |
| Model needs to follow specific instructions | Prompt engineering |
| Model needs consistent output format | Fine-tuning or structured outputs |
| Model struggles with domain terminology | Fine-tuning |
| Model needs to classify into domain categories | Fine-tuning |
| Model needs better accuracy on narrow tasks | Fine-tuning |
| Model needs to match a specific tone/style | Fine-tuning |
In our experience, about 70% of the use cases that clients think require fine-tuning are actually better served by RAG or prompt engineering. But that remaining 30% - that's where fine-tuning really shines.
How to Fine-Tune - Step by Step
Step 1 - Define the Task Precisely
Vague goals like "make the model understand our business" lead to poor fine-tuning results. Be specific.
Good task definitions:
- "Classify customer support emails into these 15 categories with 95%+ accuracy"
- "Extract medication names, dosages, and frequencies from clinical notes"
- "Generate product descriptions in our brand voice given product specifications"
- "Identify clause types in Australian commercial leases"
The more specific the task, the less training data you need and the better the results.
Step 2 - Prepare Your Training Data
This is the most important step and where most projects succeed or fail.
Data format. Fine-tuning data is typically a set of prompt-completion pairs in JSONL format:
{"messages": [
{"role": "system", "content": "You are a customer support classifier for an Australian telco."},
{"role": "user", "content": "My internet has been dropping out every evening for the past week."},
{"role": "assistant", "content": "Category: Network Connectivity\nSub-category: Intermittent Outage\nPriority: Medium\nSentiment: Frustrated"}
]}
Data quality matters more than quantity. 500 high-quality, carefully reviewed examples will outperform 5,000 messy ones. Every training example teaches the model a pattern - bad examples teach bad patterns.
Data preparation steps:
- Collect real examples from your business processes
- Have domain experts label/annotate them
- Review for consistency - are similar inputs getting similar outputs?
- Remove duplicates and near-duplicates
- Split into training (80%), validation (10%), and test (10%) sets
- Format into the required JSONL structure
How much data do you need? It varies, but as a rough guide:
| Task | Minimum Examples | Recommended |
|---|---|---|
| Text classification | 50-100 per category | 200-500 per category |
| Extraction | 200-500 | 500-1,000 |
| Style/tone matching | 100-200 | 300-500 |
| Domain adaptation | 500-1,000 | 1,000-5,000 |
Step 3 - Choose Your Base Model and Platform
Azure OpenAI fine-tuning is our default recommendation for Australian enterprise. You can fine-tune GPT-4o-mini and GPT-4o. Data stays within your Azure tenant, you get enterprise security, and the models deploy to your own endpoints.
OpenAI direct offers the same models with a simpler interface, but data goes to OpenAI's infrastructure. Not suitable for sensitive data without careful legal review.
Open-source models (Llama, Mistral) give you full control. You can fine-tune on your own infrastructure or use services like Azure Machine Learning. More work to set up but lower ongoing cost at scale and complete data control.
For most Australian organisations we work with, Azure OpenAI fine-tuning is the right balance of capability, control, and compliance.
Step 4 - Run the Fine-Tuning
The actual training process is relatively straightforward once your data is ready.
On Azure OpenAI:
- Upload your training and validation files
- Create a fine-tuning job specifying the base model and hyperparameters
- Monitor training progress and validation loss
- Deploy the fine-tuned model to an endpoint
Key hyperparameters to consider:
- Epochs - How many times the model sees each training example. Start with 3-4 for smaller datasets, 1-2 for larger ones.
- Batch size - Usually auto-selected, but can be tuned if training is unstable.
- Learning rate multiplier - Controls how aggressively the model updates. Start with the default and adjust if results are poor.
Training typically takes 30 minutes to a few hours depending on dataset size and model.
On open-source models:
Tools like LoRA (Low-Rank Adaptation) and QLoRA make fine-tuning efficient. You don't update all model weights - just a small adapter layer. This dramatically reduces the compute needed.
Base Model (frozen) + LoRA Adapter (trained) = Fine-Tuned Model
A Llama 3 70B model can be fine-tuned with QLoRA on a single A100 GPU in a few hours. The adapter weights are typically just a few hundred megabytes versus the full model's 130GB+.
Step 5 - Evaluate Rigorously
Don't just eyeball a few outputs. Use your held-out test set to measure performance properly.
Metrics to track:
- Accuracy (for classification tasks) - What percentage of predictions are correct?
- F1 score - Balances precision and recall. Better than accuracy for imbalanced datasets.
- Exact match rate (for extraction) - What percentage of extracted fields are exactly correct?
- Human evaluation (for generation) - Have domain experts rate output quality on a scale.
Compare against baselines:
- Base model with your best prompt (no fine-tuning)
- Base model with few-shot examples
- Previous approach (manual process, rule-based system)
If the fine-tuned model doesn't meaningfully beat the prompted base model, fine-tuning wasn't worth it. Go back to your data and task definition.
Step 6 - Deploy and Monitor
Fine-tuned models need ongoing attention.
Monitor for drift. If your domain changes (new products, updated regulations, different customer language), model performance will degrade. Set up regular evaluation against fresh test data.
Version your models. Keep track of which training data produced which model version. You'll need this when performance drops and you need to retrain.
Plan for retraining. Budget for quarterly or semi-annual retraining as your data and domain evolve.
Industry-Specific Examples
Financial Services
We fine-tuned a classification model for an Australian financial services firm to categorise customer complaints according to ASIC reporting categories. The base model achieved 78% accuracy with prompt engineering. After fine-tuning on 2,000 labelled examples, accuracy jumped to 94%.
The key was getting domain experts to label training data using the exact ASIC taxonomy. Consistency in labelling was more important than volume.
Manufacturing
For a manufacturing client, we fine-tuned an extraction model to pull specifications from supplier technical datasheets. These documents used inconsistent formats and industry-specific shorthand. Fine-tuning improved extraction accuracy from 72% (prompted base model) to 91%.
Professional Services
A law firm needed consistent clause classification across commercial contracts. We fine-tuned on 1,500 examples of Australian commercial contract clauses labelled by their legal team. The model now classifies clauses into 22 categories with 89% accuracy, saving hours per contract review.
Cost Considerations
Fine-tuning costs include:
- Data preparation - This is the biggest cost. Expect 40-80 hours of domain expert time for a typical project.
- Training compute - Azure OpenAI fine-tuning is priced per training token. A 1,000-example dataset on GPT-4o-mini might cost $50-200 to train.
- Inference - Fine-tuned models on Azure OpenAI cost more per token than base models. Factor this into your unit economics.
- Ongoing maintenance - Retraining, evaluation, and data updates. Budget 10-20% of initial effort annually.
For open-source models, training compute is your main cost. A fine-tuning run on a cloud GPU typically costs $10-100 depending on model size and dataset.
Getting Started with Fine-Tuning
Here's our recommended path:
- Start with prompt engineering. Optimise your prompts before considering fine-tuning.
- Collect and label data. Start building a labelled dataset from your real business processes now, even if you're not ready to fine-tune yet.
- Run a pilot. Fine-tune on a single, well-defined task with at least 500 examples. Measure results rigorously.
- Scale what works. If the pilot shows clear improvement, expand to more tasks.
If you're considering fine-tuning AI models for your industry, get in touch with our team. We help Australian organisations evaluate whether fine-tuning is the right approach and execute it properly when it is. Explore our AI development services and Azure AI consulting to learn more about how we work.