Back to Blog

Operationalising Microsoft Fabric Data Factory - What Actually Matters in Production

May 26, 20269 min readMichael Ridland

Anyone can build a Fabric Data Factory pipeline that works in dev. The hard part starts the day you put it in production and it has to run unattended at 3am while you sleep. Most of the failures I see on client sites are not Data Factory failures in any interesting sense. They are operational failures - things nobody thought about during development that turn into incidents once the pipeline is actually running for the business.

I want to talk through how we actually operationalise Fabric Data Factory pipelines for Australian customers. Microsoft's documentation has good material on the mechanics. What it does not have is the consulting-shaped advice about which mechanics matter, in what order, and why.

The setup that catches everyone

The first time you build a Fabric pipeline, you build it in the browser, you click Run, and you watch the output to see if it worked. That is fine for the first hour. After that, you need to start treating pipelines as code, not as a UI toy.

The single biggest improvement most teams can make is connecting their Fabric workspace to Git. Once a pipeline lives in source control, you get diffs, you get review, you get rollback. You stop having the conversation where someone changed a pipeline in production, broke the daily refresh, and nobody can remember what it looked like yesterday.

Git integration in Fabric is decent but not perfect. The way pipelines serialise into JSON is not always pretty to read in a diff. We have learned to keep pipeline changes small and well-commented in the commit message, because the actual code diff is sometimes hard to interpret. If you make a change that touches 12 activities, the diff is going to be ugly. Better to make small changes one at a time.

The other thing we set up early is a deployment pipeline between dev, test, and production workspaces. Fabric supports this natively. The key is that nobody develops directly in production. Sounds obvious. You would be amazed how often people do it anyway, especially in smaller teams where there is no formal process.

Scheduling and triggers

Once a pipeline works, you want it to run automatically. Fabric gives you a few options.

The simplest is a schedule trigger. You pick a time, a frequency, and a time zone, and the pipeline runs. We almost always use the time zone the business actually operates in (so Sydney or Melbourne for most of our clients) rather than UTC. The reason is that if you ever need to explain to a business stakeholder why their report did not refresh, you do not want to be doing UTC arithmetic in your head while they are upset.

For more sophisticated scenarios, you can trigger pipelines from events. A new file lands in OneLake, run the pipeline. A row gets inserted in a database, run the pipeline. This is genuinely useful but it is also where things go wrong, because you are now running pipelines at unpredictable times and you need to think about concurrency. If two files land at the same time, do you want two pipeline runs in parallel? What happens if a third file lands while the first run is still going? Fabric does not solve these questions for you. You design for them.

We had a retail client where this bit us. They had a pipeline triggered by file arrivals, files were arriving every couple of minutes during a busy period, and we ended up with 30 concurrent pipeline runs all trying to write to the same destination table. The downstream chaos was impressive. The fix was a queue table that recorded file arrivals, plus a single scheduled pipeline that processed the queue in order. Less elegant but reliable.

Error handling that actually works

Most pipelines I look at on new client engagements have no real error handling. The pipeline either works or it does not. If it does not, someone notices when a report is wrong and we go from there.

This is not good enough for anything important. A few things we always wire up:

Every pipeline gets a try-catch structure with a notification step on failure. Failure should mean someone gets emailed or messaged, with enough detail to start investigating. "Pipeline failed" is not enough detail. We include the pipeline name, the run ID, the activity that failed, and the error message.

Critical pipelines get a heartbeat check. Even if the pipeline does not fail, it might silently produce wrong results - a source system might be returning empty data because of an upstream issue. We add validation steps that check row counts, sum values, or compare against expected ranges. If the validation fails, we treat it as a pipeline failure.

Retry logic on individual activities for transient failures. Network glitches happen. Database connections drop. A short retry handles 80% of transient issues without anyone needing to wake up.

The thing nobody likes hearing is that good error handling roughly doubles the size of a pipeline. A pipeline that does the real work in three activities ends up with six activities for error handling and another two for validation. That is normal. The work is in the boring bits, not the headline activities.

Performance and cost

Fabric capacity costs money. If you are running pipelines on a shared capacity, a badly tuned pipeline can take down the capacity for other workloads. We have walked into situations where one team's nightly refresh was hammering the whole tenant and nobody knew because no one was watching capacity metrics.

A few practical things that help:

For copy activities, look at the parallelism settings. The defaults are often too conservative. Bumping parallelism on a copy from a database to OneLake can take a 90 minute job down to 15 minutes with no other changes. Test it though. Going too high can saturate your source system and cause its own problems.

For data transformations, do them as far upstream as possible. If you can push a filter or an aggregation into a source SQL query, the database does it and the data over the wire is smaller. Pulling everything into a dataflow and filtering there is always slower and more expensive.

Schedule heavy pipelines outside business hours. Sounds obvious, but a surprising number of clients have their biggest pipelines running at 10am because someone set them up while developing and never moved them. Run heavy work at 2am when nobody is looking at reports.

Monitor capacity utilisation. Fabric has a capacity metrics app that tells you which workloads are using what. If you do not look at it, you have no idea what is going on. We typically review capacity metrics weekly during the first month of a deployment to catch problems early.

Source control beyond the pipeline JSON

Pipelines are the visible part of Data Factory work. The invisible parts are the connections, the parameters, the linked services, and the credentials. These usually live outside the pipeline definition and they need their own operational story.

Connections in Fabric are workspace-level objects. Moving a pipeline between workspaces does not automatically move its connections. You have to recreate them in the target workspace, and if the connection details are different (different database server in prod, for example), you parametrise them. This is more work than it sounds, especially the first time.

Credentials are stored separately and ideally come from Azure Key Vault. Hardcoding credentials in a pipeline is something I still see in dev environments and it scares me every time. Get the Key Vault integration set up early and use it consistently. We do not let any client pipeline we manage have credentials sitting in pipeline definitions.

Documentation that survives

Pipelines have a habit of becoming undocumented six months after they go live. Someone built them, then left the project, then nobody really knows what they do. Then you have to change them and you are reverse-engineering production code at 11pm on a Friday before a quarter close.

The simplest thing that works is a README in the workspace describing what each pipeline does, what data it processes, where the source is, where the destination is, and who owns the business process it supports. One paragraph per pipeline. Update it when pipelines change.

For complex pipelines, a sequence diagram in the README helps. It does not have to be fancy. A list of steps with arrows. The goal is that someone who has never seen the pipeline can read the README and have a rough mental model in three minutes.

We have started using pipeline names that carry information. Instead of "Pipeline1" or "Sales_ETL", we use names like "daily_sales_orders_dw_to_lakehouse" that tell you the schedule, the data domain, the source, and the destination. Long names are fine. Searchable beats short.

What we actually do for clients

For most Australian customers we work with, our standard operationalisation approach looks like:

Set up dev, test, production workspaces with Git integration on day one. No exceptions.

Build pipelines in dev with full error handling and validation from the start, not as an afterthought.

Deploy through Fabric deployment pipelines with manual approval at the prod step. Manual is fine. Most data teams do not deploy frequently enough to need full CI/CD, and the cost of an automated bad deployment is high.

Wire up monitoring and alerting before the first production run, not after the first incident.

Schedule a quarterly review of pipeline performance, costs, and incidents. Things drift over time. Reviews catch the drift.

This sounds like a lot. It is a lot, the first time. The team gets faster as they go, and the alternative (running pipelines on hope and dealing with incidents as they come) is much more expensive in the long run.

Final thoughts

Fabric Data Factory is a good product. Better than its predecessors in most ways, and the integration with the rest of Fabric is genuinely useful. But it does not operationalise itself. The work to take a pipeline from "works in dev" to "runs reliably in production for two years" is real work and it is mostly outside the pipeline canvas.

If your team is new to this kind of operational discipline, getting some outside help on the first few pipelines is usually money well spent. Not because the work is technically hard but because you do not yet know what you do not know. We do this kind of implementation work regularly and the patterns we set up early tend to define what the team does for years.

Reference: Operationalize Data Factory on Microsoft Learn.