Back to Blog

Getting Started with Dataflow Gen2 in Microsoft Fabric - A Practical Walkthrough

March 30, 20268 min readMichael Ridland

Most data teams I talk to across Australia are still running some version of the same workflow. Data sits in a dozen places, somebody exports a CSV, somebody else loads it into a spreadsheet, and then the finance team asks why the numbers don't match last month's report. Sound familiar?

Dataflow Gen2 in Microsoft Fabric is Microsoft's answer to this problem - self-service data preparation that actually works at scale. We've been rolling it out with clients for a while now, and I wanted to write up an honest walkthrough of what it takes to build your first dataflow, because the official docs cover the what but not always the why.

What Is Dataflow Gen2, Really?

Think of Dataflow Gen2 as a cloud-based ETL tool that uses the Power Query engine you already know from Excel and Power BI, but running in the cloud with proper scheduling and destination support. If you've ever written an M expression in Power BI Desktop, you already understand about 70% of how Dataflow Gen2 works.

The big difference is that Dataflow Gen2 isn't tied to a single report or semantic model. You build your data transformations once, point the output at a lakehouse or warehouse, and every downstream report, dashboard, or analysis pulls from that single cleaned-up source. No more "which version of the customer table is correct?" arguments.

For Australian organisations dealing with data sovereignty requirements, the fact that all of this runs inside your Fabric capacity - which you can pin to an Australian Azure region - matters a lot.

Setting Up Your First Dataflow

Before you start, you need two things: a Fabric workspace with capacity assigned, and the right permissions to create items in that workspace. If you're on a trial, that works too - Microsoft gives you enough capacity to experiment.

Here's the process, step by step.

1. Create the dataflow

Open your Fabric workspace, click +New item, and select Dataflow Gen2. That's it. You're in the Power Query editor, which should look familiar if you've used Power Query anywhere else in the Microsoft ecosystem.

2. Connect to a data source

Click Get data and you'll see the connector gallery. Fabric supports hundreds of connectors - SQL Server, Sharepoint, REST APIs, flat files, cloud services, you name it. For a first test, the OData connector is a good choice because it doesn't require authentication setup.

Microsoft's quickstart documentation uses the Northwind sample OData service (https://services.odata.org/v4/northwind/northwind.svc/), which gives you classic Orders and Customers tables to play with. It's a decent sandbox - real enough to demonstrate actual transformations without the complexity of production data.

Select the tables you want (Orders and Customers, in this case) and click Create. Your data lands in the Power Query editor ready for transformation.

3. Transform the data

This is where the real value sits. A few tips before you start clicking buttons:

Turn on Data Profiling first. Go to Home > Options > Global Options and enable all the Column profile options. This shows you value distributions, error counts, and null percentages for every column. I'm constantly surprised how many teams skip this step and then wonder why their output has issues. Profiling tells you what's actually in your data before you start shaping it.

Also enable Diagram View from the View tab. It gives you a visual map of how your queries relate to each other, which becomes very useful once you have more than two or three queries in a dataflow.

For a practical example, let's say you want to rank customers by order volume. Here's the sequence:

  • In the Orders table, select the CustomerID column and use Group By under the Transform tab to count rows per customer
  • Switch to the Customers query and use Merge queries as new to join the grouped Orders data back onto Customers via CustomerID
  • Use Remove other columns to keep just CustomerID, CompanyName, and the merged Orders column
  • Expand the Orders column to pull in the Count
  • Use Rank column from the Add Column tab to rank customers by order count

Each of these steps creates an Applied Step in the query, and you can go back and modify any step without starting over. This is one of Power Query's best features - the step-by-step transformation model means you can experiment without fear.

Where to Send the Output

Once your data looks right, you need to decide where it goes. In the Query settings pane, scroll down and click Choose data destination. If you have a lakehouse set up, that's usually the best choice - it gives you both SQL and Spark access to the same data.

You'll pick a lakehouse, name your destination table, and choose between Append (add new rows to existing data) and Replace (overwrite the table each time). For most initial setups, Replace is simpler and avoids duplicate row issues while you're still refining your transformations.

One thing to watch out for: the first time you create a Dataflow Gen2 in a workspace, Fabric creates some background infrastructure items - a staging lakehouse and warehouse. These have names starting with DataflowStaging. Don't delete them. They're shared by all dataflows in the workspace and removing them breaks things in ways that are annoying to debug.

Publishing and Scheduling

Click Publish and your dataflow saves and runs. You'll see a spinner icon next to the dataflow name in your workspace - when it disappears, the first run is complete.

For ongoing use, set up a scheduled refresh. Click the Schedule Refresh icon next to your dataflow in the workspace, turn it on, and set a time. Daily refreshes at 4am or 5am AEST tend to work well for most business reporting scenarios - the data is fresh by the time anyone opens a dashboard in the morning.

A word on refresh timing: if your dataflow feeds into a semantic model that also has scheduled refresh, make sure the dataflow finishes before the semantic model starts. This seems obvious, but we've seen it go wrong at several client sites. Build in at least a 30-minute buffer between the two schedules.

What Works Well and What Doesn't

After deploying Dataflow Gen2 across a range of Australian businesses, here's my honest take.

What works well:

The Power Query interface is mature and familiar. If your data team already knows M expressions from Power BI, they can be productive in Dataflow Gen2 on day one. The connector library is massive. Schema view makes it easy to manage wide tables without endless horizontal scrolling. And the ability to output directly to a lakehouse or warehouse, rather than just into a Power BI semantic model, is a proper upgrade over the original Dataflow Gen1.

What's still rough:

Error messages during refresh failures could be better. When a dataflow fails at 4am, you want clear information about which step broke and why. Sometimes you get that. Sometimes you get a generic error that sends you on a wild goose chase. Monitoring and alerting around dataflow refreshes is also still maturing - if you need production-grade monitoring, plan to build some of that yourself using Fabric's APIs.

Performance on very large datasets (tens of millions of rows) can be unpredictable. For those scenarios, you might be better off using a Fabric notebook or a Data Factory pipeline with Copy activities. Dataflow Gen2 is best suited for small-to-medium data volumes with complex transformation logic, not raw data movement at scale.

What to watch out for:

The merge and append operations in Power Query are powerful but can be memory-hungry. If you're merging two large tables, keep an eye on your capacity utilisation. Also, be deliberate about which columns you keep at each stage - removing unnecessary columns early in your transformation chain improves both performance and readability.

When to Use Dataflow Gen2 vs Other Fabric Tools

This is a question we get all the time at Team 400. The short answer:

  • Dataflow Gen2 - best for business-logic-heavy transformations where your data team is comfortable with Power Query. Good for cleansing, reshaping, and enriching data from multiple sources.
  • Data Factory pipelines - best for orchestrating data movement between systems. Copy activity for moving large volumes, pipeline activities for scheduling and dependencies.
  • Notebooks - best for data science workloads, very large-scale transformations, or anything that needs Python/Spark.

In practice, most Fabric deployments we build use a combination of all three. Pipelines move raw data in, dataflows clean and transform it, and notebooks handle anything that needs custom code or machine learning. The sweet spot for Dataflow Gen2 is that middle layer - taking messy source data and turning it into clean, business-ready tables.

Getting Help With Fabric

If your organisation is looking at Microsoft Fabric and not sure where to start, we work with businesses across Australia on Fabric consulting and implementation. Whether it's setting up your first workspace, designing your data architecture, or training your team on Power Query, feel free to get in touch.

Fabric is a big platform with a lot of moving parts. But starting with a simple Dataflow Gen2 - connecting to a data source, applying some transformations, and landing the output in a lakehouse - is one of the best ways to get hands-on experience without overthinking the architecture. Build something small, learn how the pieces fit together, and scale from there.

We also run data and AI strategy sessions that help organisations figure out which parts of Fabric make sense for their specific data challenges. Sometimes the answer is "all of it" and sometimes it's "start with just Power BI and Dataflow Gen2 for now." Both are valid depending on where you are in your data maturity journey.