Fabric Data Factory Dataflows - How We Use Them to Transform Data Without the Usual Pain
If you've spent any time inside Microsoft Fabric, you've probably hit the moment where you've ingested a pile of raw data into OneLake and realised it's a mess. Column names that look like they were typed by someone in a hurry. Date fields stored as text. Numbers with currency symbols glued to them. The classic shape of data when it leaves a source system and arrives somewhere new.
Dataflows in Fabric Data Factory are designed for exactly this part of the work. They are the second tutorial Microsoft publishes for a reason. Get pipelines moving the bits around, then get dataflows shaping the bits into something usable. This post is a walkthrough of how we use them in real client work, what's actually nice about the experience, and a few of the rough edges we keep running into.
What a dataflow is in Fabric, in plain English
A dataflow in Fabric is a Power Query experience that runs on Spark under the hood and writes its results into a Lakehouse, Warehouse, or another Fabric destination of your choice. If you've used Power Query in Excel or Power BI, the editor will feel like coming home. Same ribbon. Same applied steps panel. Same right-click menu when you want to split a column or change a data type.
The difference is what happens when you hit refresh. Instead of pulling data into a local model, the steps get translated into a SQL or Spark plan that executes on the Fabric capacity. The output gets persisted to wherever you pointed it. You're not loading data into a report. You're shaping data and depositing it into the lakehouse so the rest of your stack can use it.
This is the bit that catches people out who are used to Power BI dataflows. In Fabric, the dataflow is part of the data engineering layer, not the analytics layer. The output is a table you can query, join, model, or layer on top of with a semantic model. It is not a model in itself.
The walkthrough, roughly
The Microsoft Module 2 tutorial walks you through taking the data you ingested in Module 1 and applying a set of transformations to it. The flow usually goes like this:
- Open your workspace, create a new Dataflow Gen2.
- Connect to your source - in the tutorial it's the Lakehouse you populated in Module 1. In real life it might be a SQL database, a SharePoint file, a REST API, or any of the 150 odd connectors Power Query supports.
- Apply transformations. Rename columns. Cast data types. Filter out rubbish rows. Unpivot the columns your finance team insists on putting across the top of every spreadsheet.
- Set a destination. In the tutorial it writes back to the Lakehouse. You can also send to a Warehouse, an Azure SQL DB, or a few other places.
- Save and run. The dataflow now runs on demand or on a schedule, and your transformed table is ready to be queried.
That's it as a workflow. The real work is in steps three and four, where you actually shape the data and decide where it should live.
What we like about Fabric dataflows
The biggest win, and I don't say this lightly, is the speed of building. If you have someone in your team who already knows Power Query (and in most Australian businesses that someone is the same person who built the original Excel-and-Power-BI reporting stack), they can be productive in dataflows on day one. There is no new language. There is no new editor. The thing they already use just has a different output.
This matters more than the architecture diagrams suggest. We've sat with finance teams who can describe in detail how a particular transformation needs to work, but who would never write a Spark notebook. With dataflows, they can do the transformation themselves and we step in for the bits that need scale or scheduling.
The second thing we like is the integration with the rest of Fabric. The output of a dataflow is just another table in the Lakehouse or Warehouse. That means a notebook can pick it up. A semantic model can layer on top of it. A Data Pipeline can include the dataflow as a step in a wider workflow. Nothing is locked in. Nothing requires re-exporting.
The third is incremental refresh. Once you've got a working dataflow against a reasonably sized table, you can configure it to only pull new or changed rows on each run. The Fabric implementation is a lot more accessible than the original Power BI dataflow incremental refresh, which used to require Premium and a fair bit of squinting at the docs.
If you want a deeper view of how this fits into a real Fabric build, we cover it in more detail across our Microsoft Fabric consulting work with mid-market and enterprise clients.
Where it is still rough
Honest review time. Dataflows are not perfect, and a few things bite us regularly.
Long-running dataflows are hard to debug. When a dataflow refresh fails after twenty minutes, the error message is often a wall of stack trace that points to a step you can't easily isolate. The Power Query editor lets you preview each step, but the preview is sampled. The thing that breaks in production is usually the row that wasn't in the sample. We've ended up keeping a "diagnostics" branch of every important dataflow that filters down to the rows that have historically been problematic, so we can re-run just those.
Performance varies wildly. A dataflow that does five trivial transformations on a small table can take longer to start than to actually run. The startup cost on a Fabric capacity is real, particularly on smaller SKUs. For tiny tables, a Notebook with PySpark sometimes runs end to end in less time than the dataflow takes to warm up. We don't optimise this until it bites, but you should know it's there.
Source connector quirks. Power Query has a huge connector library, and the quality varies. The first-party Microsoft ones are mostly fine. The community-contributed ones can be flaky in ways that only become obvious in production. If you're using a connector you've never used before, do a thorough test against a realistic data volume before you commit to a refresh schedule.
Output to Warehouse can be slow. Writing the result of a dataflow into a Fabric Warehouse can be much slower than writing into a Lakehouse, particularly for wide tables. For most of our recent builds we land into the Lakehouse first and use a stored procedure or a shortcut to make the data available to the Warehouse if needed. Not elegant. It works.
A pattern we keep using
When we build a Fabric data platform for a client, the typical shape ends up looking like this:
- Pipelines move raw data into a Bronze Lakehouse. No transformations.
- Dataflows or Notebooks transform Bronze into a Silver Lakehouse with cleaned, typed, deduplicated data.
- A combination of dataflows, notebooks, and stored procedures build Gold tables that match the analytics needs.
- Semantic models sit on top of Gold for Power BI.
Dataflows live in the Bronze to Silver and Silver to Gold layers, doing the work that doesn't need the full power of a notebook. When something needs Python, machine learning, or complex windowed logic, a notebook is the right tool. When something is "rename, cast, filter, join, deduplicate", a dataflow is faster to build and easier for the wider team to maintain.
This split is the difference between a Fabric build that the client can own and one where they're stuck calling us every time they want a column added. We see it as a core part of the way we approach data platform work for Australian businesses.
A few practical tips
If you're about to build your first serious dataflow in Fabric, a couple of things worth knowing up front.
Set the destination table primary key thoughtfully. Power Query will let you pick a column or set of columns. If you pick wrong, your downstream queries will perform poorly and your incremental refresh logic will produce duplicates.
Don't try to do everything in one dataflow. A dataflow with 40 steps and three sources is a debugging nightmare. Two dataflows of 20 steps each, with the first writing to a staging table that the second reads from, will save you a lot of time when something inevitably breaks. The compute cost is roughly the same. The maintenance cost is much lower.
Use the staging dataflow pattern. There's a setting that lets a dataflow stage its results in an internal storage location before writing to the final destination. This sounds wasteful and sometimes is, but for dataflows that feed multiple downstream consumers it can be a big win.
Watch your capacity usage. Fabric capacities are billed on a consumption model. A dataflow that refreshes hourly when it could refresh daily is going to show up in your bill at the end of the month. We've taken over Fabric environments where someone left every dataflow on a 15 minute schedule "just in case", and the capacity was running hot for no reason.
Worth using? Yes, with judgement
If you're already in the Microsoft stack and you have a team that knows Power Query, Fabric dataflows are an easy win. They get you from raw data to usable data with less code than a notebook, and they're approachable enough that you can hand the work off to a business analyst rather than holding it inside a small Spark team.
If you're at high data volumes, or doing transformations that are genuinely complex, you'll still want notebooks in the mix. That's fine. The two work together, and a well-designed Fabric platform will use both.
For most of the Australian businesses we work with, the right answer is "yes, use dataflows for the work they're good at, and don't try to make them do everything." Which is the same advice we'd give about any tool.
If you'd like a hand with a Fabric build, or you've got a Fabric environment that has run away from you, we can help out. And the original Microsoft tutorial is a solid starting point if you want to walk through the steps yourself: Microsoft Fabric Data Factory - Module 2: Transform data with a dataflow.