Back to Blog

Azure AI Rate Limits and Autoscale - What You Actually Need to Know

March 8, 20269 min readMichael Ridland

Nothing kills an AI proof-of-concept faster than hitting a rate limit wall in production. You've spent weeks building a document processing pipeline that works beautifully in testing, your stakeholders are excited, and then you go live and half your requests start returning 429 errors. The support tickets pile up. Someone from the business asks "is AI actually ready for this?" And your carefully planned rollout turns into a firefighting exercise.

I've watched this play out at multiple Australian organisations. The cause is almost always the same - nobody thought about rate limits until they became a problem.

Azure AI services come with default rate limits, measured in transactions per second (TPS). Out of the box, most resources cap at 10 TPS. That sounds like a lot until you realise that a single batch processing job can chew through that in seconds, or that your customer-facing chatbot just got featured in an internal newsletter and suddenly 200 people are using it at once.

Microsoft has a feature called autoscale that helps with this. It's been available for a while now, but most teams we work with either don't know about it or haven't configured it properly. Let's fix that.

How Rate Limits Actually Work

Every Azure AI resource - whether it's Vision, Language, Document Intelligence, or anything else under the Foundry Tools umbrella - has a static rate limit. This is a cap on how many API calls per second your resource will accept.

Go over that limit and you get a 429 HTTP response. "Too Many Requests." Your application needs to handle this gracefully (back off and retry), or your users see errors.

The default is typically 10 TPS. For a development environment or a small internal tool, that's usually fine. For anything with real users or batch processing requirements, it's not.

Here's the thing that catches people out: the rate limit applies to your resource, not to your application. If you've got three different applications all hitting the same Azure AI resource, they're sharing that 10 TPS budget. Monday morning rolls around, all three kick off their processing jobs, and they're fighting each other for capacity.

The Autoscale Feature

Microsoft's autoscale feature does what the name suggests - it automatically adjusts your rate limit based on demand. When your usage spikes, autoscale gradually increases your TPS cap. When demand drops, it scales back down.

The key word there is "gradually." This isn't instant. Here's how it actually works:

  1. Your application hits the current rate limit and starts getting 429 responses.
  2. Autoscale notices the spike and checks whether backend capacity is available.
  3. If capacity exists, it starts increasing your rate limit. This happens within about five minutes.
  4. If your application keeps pushing, the rate continues climbing over time.
  5. The maximum you can reach through autoscale is around 1,000 TPS, though this depends on available backend capacity.

What autoscale doesn't do: It doesn't eliminate 429 errors entirely. You'll still get them during the ramp-up period while autoscale catches up. And if backend capacity is constrained - which can happen during regional peak hours or when Microsoft is dealing with high demand - the increase might be slower or more limited.

This means your application still needs to handle 429 responses properly. Autoscale is a safety net, not a guarantee.

Turning It On

Autoscale is disabled by default. You need to opt in, and you can do it through either the Azure portal or the CLI.

Portal method: Go to your AI resource, click Overview, find the Autoscale line in the Essentials section, and click through to enable it.

CLI method: Run this command:

az resource update \
  --namespace Microsoft.CognitiveServices \
  --resource-type accounts \
  --set properties.dynamicThrottlingEnabled=true \
  --resource-group your-resource-group \
  --name your-resource-name

That's it. No other configuration. It either autoscales or it doesn't - there's no knob to set a target TPS or a maximum spend.

Which brings us to the part people don't think about until it's too late.

The Cost Trap

Autoscale doesn't change Azure AI pricing. You still pay per transaction. But higher rate limits mean more transactions get completed, which means higher bills.

On its own, that's fine - you're processing more work, so you pay more. Makes sense.

The danger is bugs. I've seen it happen twice now: a client enables autoscale, a bug in their application creates a retry loop, and suddenly they're making hundreds of calls per second. Without autoscale, the rate limit would have capped the damage at 10 TPS. With autoscale, the system happily processes as fast as it can, and the bill at the end of the month has an extra zero on it.

My strong recommendation: Don't enable autoscale until your application is stable and well-tested. Develop and test against a resource with the fixed default rate limit. Once you're confident there are no runaway loops or duplicate processing bugs, enable autoscale on your production resource. Set up Azure cost alerts so you get notified if spending exceeds expectations.

You can also disable autoscale at any time through the portal or CLI if you decide you'd rather have predictable costs than dynamic scaling. It takes about five minutes for the change to take effect.

Which Services Support Autoscale

Not every Azure AI service supports autoscale. As of now, it's available for:

  • Azure Vision (image analysis, OCR, etc.)
  • Language service (sentiment analysis, key phrase extraction, named entity recognition, text analytics for health)
  • Anomaly Detector
  • Content Moderator
  • Custom Vision (prediction endpoints only)
  • Immersive Reader
  • LUIS
  • Metrics Advisor
  • Personalizer
  • QnA Maker
  • Document Intelligence (get operations, list operations, and model management only)

Notably absent from this list: Azure OpenAI. If you're building with GPT-4, GPT-4o, or other OpenAI models through Azure, autoscale doesn't apply. Azure OpenAI has its own rate limiting and quota system based on tokens per minute, which works differently. That's a topic for another post entirely.

Also worth noting: autoscale is only available on paid tiers. Free tier resources don't get it.

What to Do When Default Limits Aren't Enough

Sometimes 10 TPS isn't enough even as a starting point, and waiting for autoscale to ramp up isn't acceptable.

In that case, you can request a higher default rate limit from Microsoft. Go to your resource in the Azure portal, open a support request, and include a business justification for why you need more capacity.

In our experience, Microsoft is generally reasonable about granting these requests, especially for production workloads with clear business use cases. Be specific about what you need and why - "we process 50,000 documents per day in batches of 1,000 and need a sustained 50 TPS" is much better than "we need more capacity."

For organisations running AI workloads at serious scale, capacity planning becomes a real exercise. You need to understand your peak loads, your processing patterns, and your tolerance for latency. This is something we help clients with regularly through our Azure AI consulting work. Getting the capacity model right up front avoids the kind of production surprises that make everyone nervous about AI projects.

Architecture Patterns That Help

Beyond just configuring autoscale, there are a few architectural decisions that make rate limit issues less painful.

Queue-based processing. Instead of calling Azure AI services directly from your application in real time, push work items into a queue (Azure Service Bus, Storage Queues, whatever you prefer) and have a background processor that pulls items at a controlled rate. This decouples your user-facing application from API rate limits. Users submit work, it gets processed as capacity allows, and results are available when ready. For batch workloads, this is almost always the right pattern.

Separate resources for separate workloads. Remember the earlier point about multiple applications sharing a single resource's rate limit? Don't do that. Your real-time customer-facing chatbot should have its own resource, separate from your batch document processing pipeline, separate from your nightly analytics job. Each gets its own rate limit, and they stop fighting each other for capacity.

Retry with exponential backoff. Your application needs to handle 429 errors no matter what. Implement exponential backoff - wait a bit, try again, wait a bit longer, try again. Most Azure SDKs have this built in, but I've seen plenty of custom integrations where someone wrote a simple retry loop with no backoff, which just hammers the API repeatedly and makes things worse.

Circuit breakers. If you're getting persistent 429 errors, stop calling the service for a while rather than continuing to hammer it. The circuit breaker pattern lets your application degrade gracefully instead of queuing up thousands of failed requests.

Monitoring and Alerting

You should absolutely set up monitoring for your Azure AI resources. At minimum:

  • Track 429 response rates. Azure Monitor can alert you when throttling exceeds a threshold. If you're getting 429s during normal business hours, something needs attention - either your rate limit is too low, or your usage patterns have changed.
  • Set cost alerts. Especially if you've enabled autoscale. Azure Cost Management lets you set budget alerts at whatever threshold makes sense for your organisation.
  • Monitor latency. Rate limits aren't the only thing that affects performance. If response times are climbing even without 429 errors, you might be approaching limits or hitting capacity constraints.

We build these monitoring dashboards as standard practice for clients running AI workloads. It's one of those things that feels like overhead during development but becomes essential in production.

Planning for Production AI

If there's one takeaway from all of this, it's that rate limits are an infrastructure concern, not an afterthought. Treat them with the same seriousness you'd give database connection pools or network bandwidth.

For organisations building their first production AI workloads on Azure, here's the conversation I'd want to have early:

  • What are your expected peak loads? Not average - peak.
  • How many concurrent users or processes will call AI services?
  • What's your tolerance for latency? Can users wait two seconds? Twenty seconds? Can processing be async?
  • What happens if the AI service is temporarily unavailable? Does the whole application stop, or can it degrade gracefully?

These questions shape your architecture, your capacity planning, and your autoscale decisions. Microsoft's documentation on the autoscale feature covers the technical details well, but the planning conversation is just as important.

If you're building AI solutions on Azure and want help getting the infrastructure right - rate limits, capacity planning, architecture patterns, monitoring - that's squarely in what our Azure AI team does. We've learned these lessons so you don't have to learn them the hard way.

The AI part of AI projects gets all the attention. But the boring infrastructure stuff - rate limits, queuing, monitoring, cost management - is what determines whether your AI project actually works in production or just demos well. Get it right early and save yourself the 2am incident call.