Data Factory Monitoring and Alerting Best Practices
A data pipeline that fails at 2 AM and nobody notices until the CFO asks why the morning report has yesterday's numbers - we've seen this happen more often than we'd like to admit. And it's almost always preventable.
Monitoring and alerting for Data Factory isn't technically difficult. The tools are there. The problem is that most teams either don't set it up at all, or they set it up badly and drown in noise until everyone ignores the alerts. Both outcomes lead to the same place: missed failures and lost trust in your data platform.
Here's how we set up monitoring and alerting for Data Factory implementations that actually work in practice.
The Three Levels of Data Factory Monitoring
Think about monitoring at three levels:
- Pipeline level - Did this specific pipeline run succeed or fail?
- Platform level - Is the Data Factory service healthy overall?
- Data quality level - Did the pipeline run successfully but move wrong or incomplete data?
Most teams only implement level 1. The ones who get burned eventually implement all three.
Level 1 - Pipeline Monitoring
Azure Data Factory
ADF has built-in monitoring in the ADF Studio under the Monitor tab. It shows:
- All pipeline runs with status (Succeeded, Failed, Cancelled, In Progress)
- Activity-level details within each run
- Trigger run history
- Integration runtime status
This is useful for ad-hoc investigation, but it's not a monitoring solution. Nobody sits watching the Monitor tab all day.
Setting up proper pipeline alerts:
- In the Azure portal, go to your Data Factory resource
- Click Alerts then Create alert rule
- Set the condition:
- Signal: Pipeline Failed Runs (metric)
- Threshold: Greater than 0
- Aggregation: Total
- Period: 5 minutes
- Set the action group (email, SMS, Teams webhook, or Logic App)
- Name the alert and set severity
Recommended pipeline alerts:
| Alert | Condition | Severity | Action |
|---|---|---|---|
| Pipeline failure | Failed runs > 0 in 5 min | Sev 2 (Warning) | Email + Teams |
| Critical pipeline failure | Failed runs on tagged pipelines | Sev 1 (Error) | Email + SMS + Teams |
| Pipeline running long | Duration > threshold | Sev 3 (Informational) | |
| Pipeline didn't run | No successful runs in expected window | Sev 2 (Warning) | Email + Teams |
The "didn't run" alert is the one most people miss. A pipeline that fails generates a failure event. A pipeline that never triggers generates nothing - and that silence is dangerous. Set up a separate check (Azure Logic App or Azure Function) that verifies expected pipelines ran within their scheduled windows.
Fabric Data Factory
Fabric's Monitoring hub shows pipeline run history within the Fabric workspace. The alerting model is different:
- Fabric doesn't yet have the same granular metric-based alerting as Azure Monitor
- You can set up alerts through Power BI dashboards that query the Fabric monitoring data
- For critical alerting, many teams use a small Azure Function that polls the Fabric APIs and sends notifications
This is one area where Fabric is still catching up to standalone ADF. If pipeline monitoring is critical to your operations (and it should be), plan for additional work when using Fabric Data Factory.
Level 2 - Platform Monitoring
Platform monitoring answers the question: "Is our Data Factory healthy, or are we about to have problems?"
What to Track
Integration Runtime Health
If you're using a self-hosted integration runtime (for on-premises data sources), monitor:
- Node status: Are all IR nodes online?
- CPU and memory utilisation: High utilisation indicates the IR is underpowered or handling too many concurrent jobs
- Concurrent job count: Are you hitting limits?
- Version status: Is the IR up to date?
Set up alerts for:
- Any IR node going offline (Sev 1 - immediate attention)
- CPU utilisation above 80% sustained for 15 minutes (Sev 2 - capacity planning needed)
- IR version more than 2 releases behind (Sev 3 - schedule an update)
Trigger Health
Triggers that stop firing are a silent killer. Monitor:
- Trigger run success/failure
- Gap between expected and actual trigger fires
- Trigger state changes (Active to Stopped)
Resource Utilisation
For Azure Data Factory:
- Data Integration Units (DIU) consumption
- Activity run counts vs. limits
- Mapping data flow cluster spin-up times
For Fabric Data Factory:
- CU consumption (are you hitting capacity limits?)
- Throttling events
- Queue times for pipeline activities
Building a Platform Dashboard
We build an operational dashboard for every Data Factory implementation. Here's what goes on it:
Page 1 - Executive Summary
- Total pipelines: running / succeeded / failed in last 24 hours
- Pipeline success rate (target: >99%)
- Average pipeline duration trend (week over week)
- Outstanding alerts
Page 2 - Pipeline Details
- Individual pipeline run history (last 7 days)
- Top 10 slowest pipelines
- Top 10 most frequently failing pipelines
- Pipelines with increasing duration trend
Page 3 - Infrastructure
- Integration runtime node status
- Resource utilisation trends
- Monthly cost trend
You can build this in Power BI using the ADF diagnostic logs (sent to Azure Log Analytics) or the ADF monitoring APIs. For Fabric, the admin APIs provide similar data.
Level 3 - Data Quality Monitoring
This is where most Data Factory monitoring stops short, and it's where the real business impact lives.
A pipeline can succeed technically - every activity completes without error - while delivering wrong, incomplete, or stale data. Without data quality checks, you won't know until someone downstream complains.
Data Quality Checks to Implement
Row count validation
After each pipeline run, compare the number of rows loaded against:
- The source row count (did we get everything?)
- The previous run's row count (is the change reasonable?)
- A minimum threshold (did we get at least something?)
Implementation: Add a Lookup activity at the end of your pipeline that counts rows in the destination and compares against expected values. If the check fails, trigger an alert.
Schema validation
Source schemas change. A column gets renamed, a new column appears, a data type changes. These changes can break pipelines or, worse, load data into wrong columns.
Implementation: Compare source schema at runtime against a stored baseline. Alert on any changes. This can be done with a Script activity that queries system tables.
Freshness checks
Is the data in your destination tables actually current? Sometimes a pipeline runs successfully but processes an empty or stale source file.
Implementation: Check the maximum timestamp in your destination table after each load. If it hasn't advanced beyond the previous run, something is wrong.
Value range validation
For business-critical fields, validate that values fall within expected ranges:
- Revenue figures should be positive
- Dates should be within reasonable bounds
- Percentages should be between 0 and 100
- Foreign keys should exist in reference tables
Implementation: Post-load SQL queries that check constraints and alert on violations.
A Practical Data Quality Pattern
Here's a pattern we use frequently:
Copy Data --> Row Count Check --> Schema Check --> Business Rules Check --> Log Results
| | |
v v v
[Alert if fail] [Alert if fail] [Alert if fail]
Each check writes results to a quality log table. A Power BI dashboard shows quality trends over time. Alerts fire only for failures, not for every successful check.
Alerting Best Practices
The Noise Problem
The most common alerting mistake is creating too many alerts. When everything alerts, nothing alerts - the team starts ignoring notifications, and real issues get lost.
Our rules for alert management:
- Every alert must have a clear owner. If no one is responsible for acting on an alert, delete it.
- Every alert must have a clear action. The alert notification should include enough context for the recipient to know what to do next.
- Use severity levels consistently:
- Sev 1 (Critical): Business impact now. Data is wrong or missing. Page someone.
- Sev 2 (Warning): Will cause business impact if not addressed within hours. Send to team channel.
- Sev 3 (Informational): No immediate impact but needs attention. Log for review.
- Review alerts monthly. If an alert hasn't fired in 3 months, consider if it's still needed. If it fires daily and is always ignored, fix the underlying issue or adjust the threshold.
- Use escalation paths. If a Sev 2 alert isn't acknowledged within 2 hours, escalate to Sev 1.
Alert Routing
| Channel | Best For | Limitations |
|---|---|---|
| Non-urgent notifications, audit trail | Easy to miss, slow response | |
| Microsoft Teams | Team notifications, collaborative triage | Can get noisy if overused |
| SMS | Critical alerts requiring immediate action | Limited information, cost per message |
| PagerDuty/OpsGenie | On-call rotation, escalation | Requires additional tooling and cost |
| Azure Logic App webhook | Custom routing, integration with ticketing | Requires setup and maintenance |
For most Australian organisations we work with, a combination of Teams (Sev 2-3) and SMS/phone (Sev 1) works well.
Alert Content
A good alert includes:
- What failed: Pipeline name, activity name
- When: Timestamp in local time zone (AEST/AEDT)
- Error message: The actual error, not a generic "pipeline failed"
- Impact: What data or report is affected
- Action link: Direct link to the pipeline run in ADF Studio or Fabric
A bad alert says "Pipeline XYZ failed." A good alert says "Pipeline SALES_DAILY_LOAD failed at 02:15 AEST. Error: SQL timeout connecting to SalesDB. Impact: Morning sales dashboard will show stale data. Action: Check SalesDB availability. Link: [direct link]."
Operational Runbooks
Alerts are useless without runbooks. For each critical pipeline, document:
- What does this pipeline do? One-paragraph description.
- When does it run? Schedule and expected duration.
- What depends on it? Downstream reports, applications, other pipelines.
- Common failure scenarios and fixes:
- Source unavailable - check source system, wait and retry
- Timeout - check data volume, increase timeout, optimise query
- Schema mismatch - check source for schema changes, update mapping
- Authentication failure - check credentials, renew tokens
- Escalation path. Who to contact if the runbook doesn't resolve the issue.
Store runbooks in a wiki or SharePoint site linked from the alert notifications. When someone gets paged at 2 AM, they shouldn't have to search for troubleshooting steps.
Monitoring Infrastructure as Code
Don't configure monitoring manually. Define your alerts, dashboards, and diagnostic settings as code (ARM templates, Bicep, or Terraform) and deploy them alongside your Data Factory pipelines.
This ensures:
- Monitoring is consistent across environments (dev/test/prod)
- New pipelines automatically get the right alerts
- Alert configuration is reviewed in pull requests alongside pipeline changes
- You can recreate your monitoring setup if something goes wrong
How We Set Up Monitoring at Team 400
Every Data Factory implementation we deliver includes operational monitoring as standard. It's not an add-on or a phase 2 item - it ships with the first pipeline.
Our standard monitoring package includes:
- Pipeline failure and latency alerting
- Integration runtime health monitoring
- Basic data quality checks (row counts, freshness)
- Operational dashboard in Power BI
- Runbook templates for common failure scenarios
For organisations with more complex requirements, we build custom monitoring solutions that integrate with existing operations tools and processes.
We work across Azure Data Factory, Fabric Data Factory, and Power BI, so we can deliver end-to-end observability from data ingestion through to the reporting layer.
Get in touch if you need help setting up monitoring for your Data Factory platform, or explore our broader data and AI services.