Data Factory Monitoring and Alerting Best Practices
Most Data Factory monitoring setups I review fall into one of two failure modes. Either there's almost nothing in place and the team finds out about broken pipelines from an angry finance director on Monday morning, or there's so much alerting that everyone has muted the channel and nobody notices when something actually breaks. Neither is acceptable for a platform that's moving the data your business runs on.
I'm Michael Ridland, founder of Team 400. We do a lot of Data Factory consulting work for Australian businesses, and monitoring is one of the topics we revisit constantly with clients. This is the practical guide I wish more teams had when they first stood up their data platform - what to monitor, what to ignore, and how to build an alerting setup that actually helps you sleep.
Why Default Monitoring Isn't Enough
When you spin up a Fabric Data Factory workspace or an Azure Data Factory instance, you get a monitoring page. You can see pipeline runs, drill into activities, view error messages, and check trigger history. For a developer building a pipeline, that's fine.
For an operations team running 200 pipelines across a production data platform, it's nowhere near enough.
The built-in monitoring tells you what happened. It doesn't tell you what should have happened and didn't. It won't ping you when a pipeline that's supposed to run every hour silently stops being triggered. It won't catch a pipeline that completes successfully but only ingested 12 rows when it should have ingested 50,000. It won't show you a slow degradation in pipeline duration that's heading toward an SLA breach next week.
If you stop at the default monitoring page, you've built a dashboard, not an operations practice. Those are different things.
What You Actually Need to Monitor
After dozens of Data Factory implementations across mining, financial services, retail, and government clients, we've settled on five layers of monitoring. Each one catches different classes of issues.
Layer 1 - Pipeline Success and Failure
This is the obvious one. Every pipeline run finishes in one of three states - succeeded, failed, or cancelled. You need to know about failures fast, and you need to know about cancellations almost as fast (cancellations often indicate someone manually killed a misbehaving run, which is itself a signal).
The mistake teams make here is alerting on every single failure. Some pipelines fail routinely due to upstream issues that resolve on retry. If you alert on each one, your on-call rotation will burn out in a week.
What we recommend instead - alert on consecutive failures, not single failures. A pipeline that fails once and succeeds on the next run is usually fine. A pipeline that fails three times in a row needs human attention. Configure your alerts accordingly.
Layer 2 - Trigger Health and Schedule Adherence
This is the layer most teams miss entirely. A pipeline that fails will scream at you. A pipeline that never runs will say nothing.
We had a client last year whose nightly customer ingest pipeline had stopped being triggered for nine days before anyone noticed. The trigger had been accidentally disabled during a release. No errors appeared anywhere because no runs were happening. Finance discovered it when their weekly report looked weird.
The fix is heartbeat monitoring. For any pipeline with a schedule, you should be tracking when it last ran successfully and alerting when that gap exceeds the expected interval plus a buffer. If a pipeline is supposed to run every hour, alert if there's no successful run within 90 minutes.
Layer 3 - Data Quality and Volume Anomalies
A pipeline that completes successfully isn't the same as a pipeline that did its job correctly. If your usual daily customer file contains 80,000 records and today's run loaded 47, the pipeline will report success. The data is broken anyway.
For every important ingestion, track the rowcount loaded and alert on significant deviations from the moving average. We typically use a 14-day moving average with an alert threshold at plus or minus 40 percent for high-volume sources, and tighter thresholds for predictable ones.
This is where most teams need to write custom monitoring. The built-in Data Factory views don't give you this. We push metrics into a Log Analytics workspace or directly into a Fabric warehouse and run our anomaly checks from there.
Layer 4 - Performance and Duration Drift
Pipelines slow down over time. Source systems grow. Indexes get fragmented. Self-hosted Integration Runtime VMs get noisy neighbours on the host. Models in transformation steps process more rows. The pipeline that ran in 18 minutes six months ago now runs in 47 minutes, and nobody noticed because it still finishes before the SLA window closes.
Until one day it doesn't.
Track pipeline duration as a metric and alert on duration that exceeds the moving average by some threshold. We use plus 50 percent as a default warning and plus 100 percent as a critical alert. This catches the slow-creep problems before they become incidents.
Layer 5 - Cost and Capacity
If you're on Fabric, your pipelines consume capacity units. If you're on Azure Data Factory, you're billed per activity run, data movement DIU-hours, and external pipeline activities. Either way, costs can spiral if a pipeline starts behaving badly without failing.
We had one client with a Fabric F64 capacity who burned through their daily smoothing window in four hours one Tuesday. Turned out a copy activity had been pointed at a new source table that was 12x larger than expected, and the dataflow was being throttled but not failing. The cost overrun was visible in the capacity metrics before any pipeline reported an issue.
If you're running on Fabric, hook capacity metrics into your monitoring stack. For ADF, push billing metrics from Azure Cost Management into your alerting setup. Set a daily ceiling and alert when projected daily spend exceeds it.
A Practical Monitoring Architecture
Here's the setup we recommend for any Australian business running production Data Factory workloads in 2026.
Metrics collection. Send all pipeline diagnostic logs to a Log Analytics workspace using diagnostic settings. For Fabric, this means enabling the Fabric monitoring hub integration and pushing detailed run data into a Lakehouse or Eventhouse. The native monitoring page is fine for ad-hoc investigation but a poor primary source.
Custom telemetry. Inside your pipelines, write structured log entries for important business metrics - rowcount loaded, source file size, target table size, validation result counts. These need to be in your monitoring system, not just stored as pipeline variables that disappear after the run.
Anomaly checks. Run scheduled queries against your collected metrics to detect anomalies. Volume drift, duration drift, missing runs. We typically write these as KQL queries in Log Analytics, set up as scheduled alert rules.
Alert routing. Different classes of alerts go to different places. Critical production failures go to PagerDuty or an on-call rotation. Volume anomalies that need investigation but aren't urgent go to a Microsoft Teams channel. Performance drift goes to an email digest. Don't send everything to the same channel.
Dashboards. Power BI is the obvious choice if you're already in the Microsoft stack. Build one dashboard for executives (SLA percentages, daily run counts, cost trends) and one for the operations team (current run status, failure rates by pipeline, recent anomalies). Don't try to put everything on one dashboard.
This whole stack ties into our Power BI consulting practice frequently, since the operations dashboards are where executive visibility lives.
Comparison - Default Monitoring vs Production Monitoring
| Capability | Default Data Factory Monitoring | Production Monitoring Setup |
|---|---|---|
| Pipeline failures | Yes, in-product | Routed to on-call with deduplication |
| Trigger health | No native heartbeat | Heartbeat alerts on missed schedules |
| Rowcount anomalies | Not captured | Tracked with moving average baselines |
| Duration drift | Visible in history but no alerts | Alerts on duration percentile shifts |
| Cost anomalies | Visible in Azure billing only | Real-time capacity and spend alerts |
| Cross-pipeline dependencies | No native view | Mapped in custom dashboard |
| Historical trending | 45 days in product | Retained as long as you need in Log Analytics |
| Custom business metrics | Not supported | First-class citizens in telemetry |
Common Mistakes I See
A few patterns come up so often that they deserve their own section.
Alerting on info-level events. Some teams configure alerts on every activity-level event. Then they ignore the alerts because there are thousands. The fix is to alert only on outcomes that need human attention - failures, anomalies, missed schedules. Everything else should be available in dashboards but not pushed to humans.
Treating all pipelines as equally critical. A pipeline that loads regulatory data for APRA reporting is not the same as a pipeline that refreshes the cafeteria menu. Both should be monitored. They shouldn't trigger the same alert routing or recovery procedures. Tag your pipelines by criticality and route alerts accordingly.
No runbook for common alerts. When an alert fires at 2am, the on-call engineer needs to know what to do. If every alert results in "ring the developer who built this pipeline" then your monitoring isn't operational, it's just an early-warning system for who to blame. Each alert type needs a runbook with the diagnostic steps and remediation playbook.
Monitoring without ownership. Pipelines should have named owners. Alerts should route to the owner first, with escalation to a platform team if unanswered. We've seen organisations where nobody knows who owns half the pipelines, and when something breaks, the resolution is "ask in the data engineering channel and hope someone responds." That's not an operations practice.
Forgetting Self-Hosted Integration Runtimes. If you're using SHIRs to reach on-premises sources, those VMs need their own monitoring - CPU, memory, disk, the SHIR service itself. A dead SHIR will fail every pipeline that depends on it, and the failure looks like a Data Factory issue rather than the infrastructure problem it actually is.
What Good Looks Like - Our Recommended Defaults
If you're starting from scratch and want a baseline, here's where we'd start every new client.
For every production pipeline:
- Failure alert after two consecutive failures, routed to the owning team
- Heartbeat alert if no successful run within (expected interval x 1.5)
- Duration alert if last 3 runs exceed 14-day median by 50 percent
- Rowcount alert if delta from 14-day mean exceeds 40 percent in either direction
For the platform overall:
- Daily summary email with run counts, failure rates, and top duration drifters
- Capacity utilisation alert at 80 percent of daily Fabric smoothing window
- Weekly cost review against budget with anomaly detection
- Self-hosted IR health checks every 5 minutes with alerts on agent disconnect
This isn't an exhaustive list. It's a baseline you can grow from. Most clients add 20 to 40 percent more custom monitoring over the first six months as they identify specific business signals that matter to them.
When to Bring in Help
A simple Data Factory environment with a dozen pipelines can be monitored adequately by a competent data engineer in a couple of weeks. Templates, KQL queries, and Azure Monitor configurations are all well documented.
You probably want outside help when:
- You have more than 50 pipelines and no consistent monitoring practice across them
- You've had production incidents that monitoring should have caught and didn't
- You're moving from Azure Data Factory to Fabric Data Factory and need to redo your monitoring stack
- You're building a new platform and want to avoid the typical 12 months of monitoring debt
- You need APRA-aligned operational practices and don't have the documentation
We do monitoring uplifts as standalone engagements (typically $25,000 to $60,000 AUD depending on scope) or as part of broader Data Factory consulting projects. The standalone version usually takes 4 to 6 weeks and includes the architecture, implementation, runbooks, and team handover.
If you want to talk through your current setup or get a second opinion on where to invest first, get in touch. We'd rather have a 30-minute conversation about your specific situation than write generic advice that doesn't quite fit. Monitoring is one of those areas where the right answer depends heavily on what you've already got, what's broken, and what your team can realistically maintain.
Good monitoring is boring. It just works. Alerts only fire when something needs attention, dashboards show what people actually need to see, and the team trusts the system. If that's not what you have today, the gap is fixable - and fixing it pays for itself the first time it catches an incident before your business notices.