Back to Blog

AI Incident Response - What to Do When Your AI System Gets It Wrong

April 19, 202611 min readMichael Ridland

What do you do when your AI system makes a mistake?

Not if - when. Every AI system will produce incorrect, unexpected, or harmful outputs at some point. The question is whether you're prepared for it.

In our experience, the businesses that handle AI incidents well are the ones that planned for them before they happened. The ones that handle them poorly are the ones that assumed their AI would always work correctly.

Here's how to build and execute an AI incident response plan.

Why AI Incidents Are Different

AI incidents share characteristics with traditional IT incidents, but they have unique qualities that require specific handling:

Non-deterministic behaviour. Traditional software bugs produce the same wrong output every time for the same input. AI systems can produce different outputs for similar inputs, making incidents harder to reproduce and diagnose.

Gradual degradation. Traditional systems tend to fail suddenly - they work or they don't. AI systems can degrade gradually as data drifts, models age, or edge cases accumulate. By the time someone notices, the problem may have been affecting customers for weeks.

Scale of impact. A flawed AI system can make thousands of wrong decisions before the problem is detected. Unlike a human making one mistake at a time, AI errors can be systematic and widespread.

Root cause complexity. AI failures can stem from the model itself, the training data, the input data, the integration, the infrastructure, or the interaction between all of these. Diagnosis often requires multiple specialisms.

Reputational sensitivity. AI mistakes attract disproportionate media and public attention. "AI system discriminates against customers" is a much bigger story than "software bug causes processing error."

Types of AI Incidents

Understanding the types of incidents you might face helps you prepare for them:

Accuracy Failures

The AI produces incorrect outputs - wrong answers, incorrect classifications, flawed recommendations.

Examples:

  • A customer service AI gives wrong information about products or policies
  • A document processing AI misclassifies invoices, causing incorrect accounting
  • A recommendation system suggests unsuitable products

Bias and Fairness Incidents

The AI treats different groups of people differently in ways that are unfair or discriminatory.

Examples:

  • A lending AI approves applications at different rates for different demographic groups
  • A hiring AI systematically ranks candidates from certain backgrounds lower
  • A pricing AI charges different amounts based on factors that correlate with protected attributes

Data Incidents

Problems with the data flowing into or out of the AI system.

Examples:

  • Personal information leaked through AI outputs
  • Training data exposed through a security breach
  • AI system accessing data it shouldn't have
  • AI outputs containing data from one customer visible to another

Security Incidents

The AI system is compromised or manipulated.

Examples:

  • Prompt injection attacks causing the AI to behave maliciously
  • Adversarial inputs causing the AI to make wrong decisions
  • Unauthorised access to AI models or training data
  • AI system used as a vector for broader system compromise

Operational Incidents

The AI system fails to operate correctly.

Examples:

  • AI system goes offline, disrupting business processes
  • Performance degrades to unacceptable levels
  • AI system consumes excessive resources, affecting other systems
  • Integration failures between the AI and other systems

Reputational Incidents

The AI produces outputs that damage your organisation's reputation.

Examples:

  • AI generates offensive or inappropriate content
  • AI makes statements that contradict your organisation's values
  • AI interactions go viral on social media for the wrong reasons
  • AI decisions that appear unfair generate media coverage

Building an AI Incident Response Plan

Don't wait for an incident to figure out how you'll respond. Build the plan now.

1. Define What Constitutes an AI Incident

Not every AI mistake is an incident. Define thresholds:

Severity levels:

Critical: Immediate, significant harm to customers or the business. Examples: AI making discriminatory decisions at scale, personal data breach through AI, AI causing financial loss to customers. Response: Immediate, all-hands.

High: Significant impact on customers or operations, but contained. Examples: AI giving wrong advice to customers, noticeable bias detected, AI system offline affecting critical processes. Response: Within 1 hour, dedicated response team.

Medium: Moderate impact, limited in scope. Examples: AI accuracy drops below threshold, intermittent errors in AI outputs, performance degradation. Response: Within 4 hours, assigned responders.

Low: Minor issues, minimal customer impact. Examples: Occasional incorrect outputs caught by human review, minor performance issues, cosmetic problems. Response: Next business day, normal operational handling.

2. Establish the Response Team

Define who is involved and their roles:

Incident Commander: Owns the overall response. Makes decisions about severity, resources, and communication. Usually a senior manager or director.

Technical Lead: Leads diagnosis and resolution. Coordinates the technical team. Usually the AI system's technical owner or a senior engineer.

Business Owner: Represents the business function affected. Assesses business impact. Decides on interim business processes. Usually the AI system's business owner.

Communications Lead: Manages internal and external communication. Coordinates with PR, legal, and customer service. Usually from communications or marketing.

Legal/Compliance: Assesses legal and regulatory obligations. Advises on disclosure requirements. Determines if regulators need to be notified. Usually from legal or compliance teams.

Customer Service Lead: Manages customer-facing response. Briefs service teams. Handles customer complaints related to the incident. Usually from customer service management.

Not every incident needs every role. Scale the team to the severity.

3. Define the Response Process

Phase 1 - Detection and Reporting

How incidents are detected:

  • Automated monitoring and alerting
  • Human review processes
  • Customer complaints
  • Employee reports
  • Regulatory enquiries
  • Media reports

How incidents are reported:

  • Clear reporting channels (email address, messaging channel, phone number)
  • Simple reporting form (what happened, when, who's affected, how severe)
  • Accessible to everyone - not just the technical team

Phase 2 - Triage and Assessment

Within the first 30-60 minutes of detection:

  1. Confirm the incident - Is this actually an AI problem? What's happening?
  2. Assess severity - How many people are affected? What's the potential harm?
  3. Classify the incident type - Accuracy, bias, data, security, operational, reputational?
  4. Activate the response team - Based on severity, assemble the right people
  5. Initial containment - Can we stop the harm while we investigate?

Phase 3 - Containment

Stop the bleeding. Options depend on the incident:

  • Disable the AI system - If the harm is ongoing and severe, turn it off
  • Revert to manual processes - Fall back to human handling
  • Roll back to a previous version - If a recent change caused the problem
  • Apply guardrails - Restrict the AI's scope or capabilities while investigating
  • Increase human review - Add human checks to AI outputs

The key question: Is it safer to keep the AI running or to shut it down? Both options have consequences. A credit decisioning AI that's occasionally making wrong decisions might still be better than stopping all credit decisions. An AI chatbot that's giving dangerous health advice should be shut down immediately.

Have pre-agreed criteria for when to disable AI systems. Don't leave this decision to the person on call at 2am.

Phase 4 - Investigation

Find out what happened and why:

  • What outputs did the AI produce?
  • When did the problem start?
  • What changed? (Data, model, configuration, integration, environment)
  • How many people were affected?
  • What was the actual harm?
  • Is the root cause understood?

AI-specific investigation steps:

  • Review model inputs and outputs around the time of the incident
  • Check for data drift or data quality issues
  • Review recent model updates or configuration changes
  • Check integration points for failures or changes
  • Analyse model behaviour for the affected cases
  • Review monitoring data for early warning signs that were missed

Document everything. You'll need this for the post-incident review, regulatory reporting (if required), and potential legal proceedings.

Phase 5 - Resolution

Fix the problem:

  • Apply the fix (model update, data correction, configuration change, code fix)
  • Test the fix before deploying to production
  • Monitor closely after deploying
  • Confirm the incident is resolved

Phase 6 - Recovery

Address the consequences:

  • Identify all affected customers or stakeholders
  • Determine appropriate remediation (apology, correction, compensation)
  • Communicate the resolution
  • Restore normal operations
  • Confirm monitoring is in place to detect recurrence

Phase 7 - Post-Incident Review

Learn from the incident:

  • What happened and why?
  • How was it detected? Could it have been detected earlier?
  • Was the response effective? What could be improved?
  • What systemic changes are needed to prevent recurrence?
  • Do processes, monitoring, or governance need to be updated?

Conduct the review within 1-2 weeks of the incident while details are fresh. Document the findings and track improvement actions to completion.

4. Communication Protocols

Internal communication:

  • Who is informed and when (based on severity)?
  • What communication channels are used?
  • How frequently are updates provided?
  • Who approves external communications?

Customer communication:

  • When do customers need to be informed?
  • What channels are used (email, in-app, phone, website)?
  • What level of detail is appropriate?
  • Who handles customer enquiries?

Regulatory communication:

  • When must regulators be notified?
  • The Notifiable Data Breaches scheme requires notification to the OAIC within 30 days for eligible data breaches
  • APRA-regulated entities have additional reporting obligations
  • What information do regulators need?

Media communication:

  • Who handles media enquiries?
  • What's the approved messaging?
  • Who approves media statements?

General communication principles:

  • Be honest about what happened
  • Explain what you're doing about it
  • Acknowledge the impact on affected people
  • Don't speculate about causes before the investigation is complete
  • Follow up when you have more information

AI-Specific Monitoring

Good monitoring reduces the time between an AI problem starting and being detected. For AI systems, monitor:

Performance Metrics

  • Accuracy and error rates (overall and by segment)
  • Confidence score distributions
  • Response time and throughput
  • Error types and patterns

Data Quality

  • Input data distributions (detect drift)
  • Missing data rates
  • Data format anomalies
  • Volume anomalies

Fairness Metrics

  • Outcome rates by demographic group
  • Accuracy by demographic group
  • Complaint rates by group
  • Override rates by group

Operational Metrics

  • System availability
  • Resource utilisation
  • Cost per query/decision
  • Queue depths and processing times

Business Metrics

  • Customer satisfaction scores
  • Complaint rates
  • Escalation rates
  • Override rates

Set alerts for each metric. Don't rely on someone remembering to check dashboards. Automated alerts that trigger when metrics cross thresholds are essential.

Preparing Your Organisation

Training

Everyone involved in AI incident response needs to know their role:

  • Technical teams need to know how to diagnose AI-specific problems
  • Business teams need to know how to activate manual fallback processes
  • Customer service teams need to know how to handle AI-related complaints
  • Leadership needs to know how to make containment decisions

Exercises

Run tabletop exercises to test your incident response plan. Present realistic AI incident scenarios and walk through the response:

  • Who gets called?
  • What decisions need to be made?
  • What information is needed?
  • How is communication handled?
  • What are the gaps in the plan?

We recommend running exercises at least annually, and after any significant changes to AI systems.

Documentation

Maintain current documentation for every AI system:

  • System architecture and components
  • Known limitations and failure modes
  • Rollback procedures
  • Manual fallback processes
  • Monitoring and alerting setup
  • Key contacts and escalation paths

This documentation is useless if it's outdated. Review it quarterly.

Regulatory Reporting Obligations

Australian businesses have specific reporting obligations that may apply to AI incidents:

Notifiable Data Breaches (Privacy Act): If the AI incident involves unauthorised access to or disclosure of personal information, and is likely to result in serious harm, you must notify the OAIC and affected individuals.

APRA reporting: APRA-regulated entities must report material operational risk events, which may include significant AI incidents.

ASIC reporting: Material AI failures in financial services may trigger ASIC reporting obligations.

Consumer law: If an AI incident affects consumers, ACCC may need to be informed depending on the nature and scale.

Know your reporting obligations before an incident occurs. During an incident is not the time to figure out what you're required to report.

After the Incident - Building Resilience

Every incident is a learning opportunity. After each AI incident:

  1. Update your risk assessment for the affected AI system
  2. Improve monitoring based on how the incident was (or should have been) detected
  3. Update your incident response plan based on what worked and what didn't
  4. Share learnings across the organisation (without blame)
  5. Update AI governance if the incident reveals governance gaps
  6. Brief leadership on systemic issues and improvement plans

How Team 400 Helps

At Team 400, we build AI systems with incident response in mind from the start. That means monitoring, alerting, fallback procedures, and documentation are part of every AI project we deliver.

We also help organisations develop AI incident response plans, train their teams, and conduct tabletop exercises. Because the time to prepare for an AI incident is before it happens.

If you need help building AI incident response capability, or if you're dealing with an AI incident right now, contact us. We can help you respond effectively and build resilience for the future.