Back to Blog

Data Factory Monitoring and Alerting Best Practices

April 21, 202610 min readMichael Ridland

A data pipeline that fails at 2 AM and nobody notices until the CFO asks why the morning report has yesterday's numbers - we've seen this happen more often than we'd like to admit. And it's almost always preventable.

Monitoring and alerting for Data Factory isn't technically difficult. The tools are there. The problem is that most teams either don't set it up at all, or they set it up badly and drown in noise until everyone ignores the alerts. Both outcomes lead to the same place: missed failures and lost trust in your data platform.

Here's how we set up monitoring and alerting for Data Factory implementations that actually work in practice.

The Three Levels of Data Factory Monitoring

Think about monitoring at three levels:

  1. Pipeline level - Did this specific pipeline run succeed or fail?
  2. Platform level - Is the Data Factory service healthy overall?
  3. Data quality level - Did the pipeline run successfully but move wrong or incomplete data?

Most teams only implement level 1. The ones who get burned eventually implement all three.

Level 1 - Pipeline Monitoring

Azure Data Factory

ADF has built-in monitoring in the ADF Studio under the Monitor tab. It shows:

  • All pipeline runs with status (Succeeded, Failed, Cancelled, In Progress)
  • Activity-level details within each run
  • Trigger run history
  • Integration runtime status

This is useful for ad-hoc investigation, but it's not a monitoring solution. Nobody sits watching the Monitor tab all day.

Setting up proper pipeline alerts:

  1. In the Azure portal, go to your Data Factory resource
  2. Click Alerts then Create alert rule
  3. Set the condition:
    • Signal: Pipeline Failed Runs (metric)
    • Threshold: Greater than 0
    • Aggregation: Total
    • Period: 5 minutes
  4. Set the action group (email, SMS, Teams webhook, or Logic App)
  5. Name the alert and set severity

Recommended pipeline alerts:

Alert Condition Severity Action
Pipeline failure Failed runs > 0 in 5 min Sev 2 (Warning) Email + Teams
Critical pipeline failure Failed runs on tagged pipelines Sev 1 (Error) Email + SMS + Teams
Pipeline running long Duration > threshold Sev 3 (Informational) Email
Pipeline didn't run No successful runs in expected window Sev 2 (Warning) Email + Teams

The "didn't run" alert is the one most people miss. A pipeline that fails generates a failure event. A pipeline that never triggers generates nothing - and that silence is dangerous. Set up a separate check (Azure Logic App or Azure Function) that verifies expected pipelines ran within their scheduled windows.

Fabric Data Factory

Fabric's Monitoring hub shows pipeline run history within the Fabric workspace. The alerting model is different:

  • Fabric doesn't yet have the same granular metric-based alerting as Azure Monitor
  • You can set up alerts through Power BI dashboards that query the Fabric monitoring data
  • For critical alerting, many teams use a small Azure Function that polls the Fabric APIs and sends notifications

This is one area where Fabric is still catching up to standalone ADF. If pipeline monitoring is critical to your operations (and it should be), plan for additional work when using Fabric Data Factory.

Level 2 - Platform Monitoring

Platform monitoring answers the question: "Is our Data Factory healthy, or are we about to have problems?"

What to Track

Integration Runtime Health

If you're using a self-hosted integration runtime (for on-premises data sources), monitor:

  • Node status: Are all IR nodes online?
  • CPU and memory utilisation: High utilisation indicates the IR is underpowered or handling too many concurrent jobs
  • Concurrent job count: Are you hitting limits?
  • Version status: Is the IR up to date?

Set up alerts for:

  • Any IR node going offline (Sev 1 - immediate attention)
  • CPU utilisation above 80% sustained for 15 minutes (Sev 2 - capacity planning needed)
  • IR version more than 2 releases behind (Sev 3 - schedule an update)

Trigger Health

Triggers that stop firing are a silent killer. Monitor:

  • Trigger run success/failure
  • Gap between expected and actual trigger fires
  • Trigger state changes (Active to Stopped)

Resource Utilisation

For Azure Data Factory:

  • Data Integration Units (DIU) consumption
  • Activity run counts vs. limits
  • Mapping data flow cluster spin-up times

For Fabric Data Factory:

  • CU consumption (are you hitting capacity limits?)
  • Throttling events
  • Queue times for pipeline activities

Building a Platform Dashboard

We build an operational dashboard for every Data Factory implementation. Here's what goes on it:

Page 1 - Executive Summary

  • Total pipelines: running / succeeded / failed in last 24 hours
  • Pipeline success rate (target: >99%)
  • Average pipeline duration trend (week over week)
  • Outstanding alerts

Page 2 - Pipeline Details

  • Individual pipeline run history (last 7 days)
  • Top 10 slowest pipelines
  • Top 10 most frequently failing pipelines
  • Pipelines with increasing duration trend

Page 3 - Infrastructure

  • Integration runtime node status
  • Resource utilisation trends
  • Monthly cost trend

You can build this in Power BI using the ADF diagnostic logs (sent to Azure Log Analytics) or the ADF monitoring APIs. For Fabric, the admin APIs provide similar data.

Level 3 - Data Quality Monitoring

This is where most Data Factory monitoring stops short, and it's where the real business impact lives.

A pipeline can succeed technically - every activity completes without error - while delivering wrong, incomplete, or stale data. Without data quality checks, you won't know until someone downstream complains.

Data Quality Checks to Implement

Row count validation

After each pipeline run, compare the number of rows loaded against:

  • The source row count (did we get everything?)
  • The previous run's row count (is the change reasonable?)
  • A minimum threshold (did we get at least something?)

Implementation: Add a Lookup activity at the end of your pipeline that counts rows in the destination and compares against expected values. If the check fails, trigger an alert.

Schema validation

Source schemas change. A column gets renamed, a new column appears, a data type changes. These changes can break pipelines or, worse, load data into wrong columns.

Implementation: Compare source schema at runtime against a stored baseline. Alert on any changes. This can be done with a Script activity that queries system tables.

Freshness checks

Is the data in your destination tables actually current? Sometimes a pipeline runs successfully but processes an empty or stale source file.

Implementation: Check the maximum timestamp in your destination table after each load. If it hasn't advanced beyond the previous run, something is wrong.

Value range validation

For business-critical fields, validate that values fall within expected ranges:

  • Revenue figures should be positive
  • Dates should be within reasonable bounds
  • Percentages should be between 0 and 100
  • Foreign keys should exist in reference tables

Implementation: Post-load SQL queries that check constraints and alert on violations.

A Practical Data Quality Pattern

Here's a pattern we use frequently:

Copy Data --> Row Count Check --> Schema Check --> Business Rules Check --> Log Results
                  |                    |                    |
                  v                    v                    v
            [Alert if fail]      [Alert if fail]      [Alert if fail]

Each check writes results to a quality log table. A Power BI dashboard shows quality trends over time. Alerts fire only for failures, not for every successful check.

Alerting Best Practices

The Noise Problem

The most common alerting mistake is creating too many alerts. When everything alerts, nothing alerts - the team starts ignoring notifications, and real issues get lost.

Our rules for alert management:

  1. Every alert must have a clear owner. If no one is responsible for acting on an alert, delete it.
  2. Every alert must have a clear action. The alert notification should include enough context for the recipient to know what to do next.
  3. Use severity levels consistently:
    • Sev 1 (Critical): Business impact now. Data is wrong or missing. Page someone.
    • Sev 2 (Warning): Will cause business impact if not addressed within hours. Send to team channel.
    • Sev 3 (Informational): No immediate impact but needs attention. Log for review.
  4. Review alerts monthly. If an alert hasn't fired in 3 months, consider if it's still needed. If it fires daily and is always ignored, fix the underlying issue or adjust the threshold.
  5. Use escalation paths. If a Sev 2 alert isn't acknowledged within 2 hours, escalate to Sev 1.

Alert Routing

Channel Best For Limitations
Email Non-urgent notifications, audit trail Easy to miss, slow response
Microsoft Teams Team notifications, collaborative triage Can get noisy if overused
SMS Critical alerts requiring immediate action Limited information, cost per message
PagerDuty/OpsGenie On-call rotation, escalation Requires additional tooling and cost
Azure Logic App webhook Custom routing, integration with ticketing Requires setup and maintenance

For most Australian organisations we work with, a combination of Teams (Sev 2-3) and SMS/phone (Sev 1) works well.

Alert Content

A good alert includes:

  • What failed: Pipeline name, activity name
  • When: Timestamp in local time zone (AEST/AEDT)
  • Error message: The actual error, not a generic "pipeline failed"
  • Impact: What data or report is affected
  • Action link: Direct link to the pipeline run in ADF Studio or Fabric

A bad alert says "Pipeline XYZ failed." A good alert says "Pipeline SALES_DAILY_LOAD failed at 02:15 AEST. Error: SQL timeout connecting to SalesDB. Impact: Morning sales dashboard will show stale data. Action: Check SalesDB availability. Link: [direct link]."

Operational Runbooks

Alerts are useless without runbooks. For each critical pipeline, document:

  1. What does this pipeline do? One-paragraph description.
  2. When does it run? Schedule and expected duration.
  3. What depends on it? Downstream reports, applications, other pipelines.
  4. Common failure scenarios and fixes:
    • Source unavailable - check source system, wait and retry
    • Timeout - check data volume, increase timeout, optimise query
    • Schema mismatch - check source for schema changes, update mapping
    • Authentication failure - check credentials, renew tokens
  5. Escalation path. Who to contact if the runbook doesn't resolve the issue.

Store runbooks in a wiki or SharePoint site linked from the alert notifications. When someone gets paged at 2 AM, they shouldn't have to search for troubleshooting steps.

Monitoring Infrastructure as Code

Don't configure monitoring manually. Define your alerts, dashboards, and diagnostic settings as code (ARM templates, Bicep, or Terraform) and deploy them alongside your Data Factory pipelines.

This ensures:

  • Monitoring is consistent across environments (dev/test/prod)
  • New pipelines automatically get the right alerts
  • Alert configuration is reviewed in pull requests alongside pipeline changes
  • You can recreate your monitoring setup if something goes wrong

How We Set Up Monitoring at Team 400

Every Data Factory implementation we deliver includes operational monitoring as standard. It's not an add-on or a phase 2 item - it ships with the first pipeline.

Our standard monitoring package includes:

  • Pipeline failure and latency alerting
  • Integration runtime health monitoring
  • Basic data quality checks (row counts, freshness)
  • Operational dashboard in Power BI
  • Runbook templates for common failure scenarios

For organisations with more complex requirements, we build custom monitoring solutions that integrate with existing operations tools and processes.

We work across Azure Data Factory, Fabric Data Factory, and Power BI, so we can deliver end-to-end observability from data ingestion through to the reporting layer.

Get in touch if you need help setting up monitoring for your Data Factory platform, or explore our broader data and AI services.