Back to Blog

Microsoft Fabric Data Factory - Getting Started with End-to-End Data Integration

March 28, 20269 min readMichael Ridland

Microsoft Fabric's Data Factory is the piece of Fabric that most data teams will touch first. It's where raw data enters the platform, gets cleaned up, and starts becoming useful. And if your organisation is on the Microsoft stack - which describes most Australian enterprises I work with - it's probably going to be your primary data integration tool within the next couple of years.

I've been watching Fabric mature since its launch, and Data Factory specifically has gone from "interesting preview" to "genuinely production-ready" faster than I expected. That said, there are still things you need to understand before jumping in, and Microsoft's own tutorial - while decent - skips over some of the practical realities.

What Data Factory in Fabric Actually Is

If you've used Azure Data Factory (ADF), the Fabric version will feel familiar but also different in ways that matter. Both handle data movement and transformation. But Fabric's Data Factory sits inside the broader Fabric ecosystem, which means your pipelines, dataflows, lakehouses, and notebooks all live in the same workspace. No more juggling between Azure portal tabs.

Microsoft describes three core capabilities, and this framing is accurate:

Copy jobs move data from source to destination. Think of them as your data movers - they pull from hundreds of connectors (databases, APIs, file storage, SaaS platforms) and land data in your Lakehouse. The scale here is real. We've moved petabyte-scale datasets through copy jobs without issues.

Dataflow Gen2 handles transformation. If you've used Power Query before, you'll recognise the interface. It's the same low-code, visual transformation experience, but running at cloud scale instead of on your laptop. You get 300+ transformations and can write output to multiple destinations - Lakehouse tables, Azure SQL, warehouses, and more.

Pipelines orchestrate everything. Chain your copy jobs, dataflows, notebooks, and stored procedures together. Run things in sequence or parallel. Add conditions, loops, and error handling. Monitor the whole flow from one place.

The Microsoft tutorial walks through all three using the NYC Taxi dataset, and it's a reasonable way to spend an hour getting familiar with the mechanics. But let me tell you what the tutorial doesn't cover.

The Medallion Architecture - Why It Matters

The tutorial mentions bronze and gold tables without explaining why this matters. Here's the short version.

The medallion architecture (bronze, silver, gold) is a pattern for organising data in a lakehouse. Bronze holds raw data exactly as it arrived. Silver is cleaned and validated. Gold is business-ready, aggregated, and modelled for consumption.

Data Factory's natural workflow maps to this pattern well. Copy jobs land data in bronze. Dataflows clean it up and move it to silver or gold. It's not the only way to organise your lakehouse, but it's a sensible default that we use with most clients.

The mistake I see is teams skipping straight to gold. They pull data in, transform it immediately, and only keep the final output. Six months later, someone asks a question that requires the raw data, and it's gone. Always keep your bronze layer. Storage is cheap. Regret is expensive.

Setting Up Your First Pipeline - Practical Notes

The tutorial has you creating a pipeline with a copy job that moves data from Azure Blob Storage into a Lakehouse. The mechanics are straightforward: create a pipeline, add a copy activity, configure source and destination, run it. But here's what I'd add from experience.

Name things properly from day one. I know it sounds tedious, but when you've got 40 pipelines in a workspace, "Pipeline 1" and "Copy data" aren't going to help anyone. We use a naming convention like pl_source_destination_description for pipelines and df_description for dataflows. Pick a convention and stick to it.

Use parameterised connections. Don't hardcode your source paths and connection strings. Use pipeline parameters so the same pipeline can point at different environments (dev, test, prod) without modification. This saves enormous amounts of time when you're promoting changes through environments.

Understand the Lakehouse table format. When your copy job lands data in a Lakehouse, it writes Delta tables. This is important because Delta gives you versioning, time travel, and ACID transactions for free. But it also means you need to think about how data gets written - append, overwrite, or merge. The default is usually append, which is fine for initial loads but will give you duplicates if you run the pipeline twice.

Dataflow Gen2 - Where Power Query Meets Cloud Scale

If you've used Power Query in Excel or Power BI Desktop, Dataflow Gen2 will feel like coming home. Same visual interface, same M language underneath, same transformation steps. The difference is that it runs in Fabric's compute, not on your machine.

This matters more than it might seem. I've had clients with Power Query transformations that take 45 minutes to refresh on a decent laptop. The same logic in Dataflow Gen2 completes in minutes because it's running on distributed compute.

Where Dataflow Gen2 really shines:

The low-code interface makes it accessible to analysts who aren't comfortable writing code. Your business analyst who already knows Power Query can build and maintain dataflows without learning Python or SQL. That's a genuine advantage for smaller teams where the BI person is also the data engineer.

Where it gets frustrating:

Complex transformations can be hard to express in the visual interface. If you're doing multi-step joins across five tables with conditional logic, the Power Query editor starts feeling cramped. For those scenarios, consider using a Fabric notebook (Python or Spark SQL) instead. You can call notebooks from your pipeline just like you call dataflows.

Also, debugging dataflows is still rougher than it should be. When something fails in the middle of a 20-step transformation, the error messages don't always point you to the right step. Test your transformations incrementally - add a few steps, verify the output, add more.

Automating the Whole Flow

This is where Fabric Data Factory earns its keep. A pipeline that runs manually is a demo. A pipeline that runs on schedule, handles failures, and notifies people is a production system.

Scheduling is straightforward - you set a trigger and it runs at the specified interval. Daily, hourly, whatever makes sense for your data freshness requirements. Most of our clients run daily overnight refreshes for their warehousing workloads.

Error handling needs more thought than the tutorial gives it. Add failure paths to your pipeline activities. If a copy job fails, do you want to retry? Send an email? Log the error and continue with the next source? These decisions matter in production, and it's much easier to build them in from the start than to retrofit them after your first 3am failure.

Email notifications are built in through Office 365 connectors. The tutorial covers this, and it works. But for anything beyond basic notifications, consider using a Teams webhook or integrating with your existing alerting tools. We've set up pipelines that post to Teams channels with detailed success/failure summaries, which is far more useful than a generic "pipeline completed" email.

Azure Data Factory vs Fabric Data Factory

This is the question every existing ADF user asks, and the answer is nuanced.

If you're already invested in Azure Data Factory and it's working well for your needs, there's no urgent reason to migrate. ADF isn't going anywhere. But if you're starting new data integration projects, Fabric's Data Factory is where Microsoft is putting its investment.

The main advantages of Fabric Data Factory over ADF:

  • Unified experience. Everything lives in one workspace. No switching between portals.
  • Lakehouse integration. Direct, first-class support for Lakehouse tables rather than treating them as just another destination.
  • OneLake. All your data is in one logical data lake, regardless of which Fabric workload created it.
  • Simplified licensing. Fabric capacity covers everything, rather than paying per pipeline run.

The main things you give up:

  • Mature SSIS integration. If you're running legacy SSIS packages, ADF's SSIS Integration Runtime is still the better option.
  • Some enterprise features. ADF has had longer to mature, and some governance and monitoring features are more developed.

We wrote about this comparison in more detail in our earlier post on Azure Data Factory vs Fabric Data Factory, which covers the decision framework we use with clients.

What Australian Businesses Should Think About

Most of the Australian organisations we work with are somewhere on the Fabric adoption spectrum. Some are all-in, some are evaluating, and some haven't started yet. Regardless of where you are, here's what I'd recommend.

Start with a real use case, not a sandbox. The tutorial is good for learning the mechanics, but the real learning happens when you connect to your actual data sources and deal with your actual data quality issues. Pick a small but real project - a monthly report that currently takes someone three days of manual data wrangling, for example.

Get your licensing sorted early. Fabric licensing is capacity-based (F-SKUs or P-SKUs). If your organisation already has Power BI Premium capacity, you might already have access to Fabric. If not, you'll need to provision capacity. Start with F2 or F4 for development and testing, and scale up for production.

Think about governance from the start. Fabric makes it easy to create workspaces, lakehouses, and pipelines. Too easy, sometimes. Without governance, you'll end up with sprawling workspaces and duplicated data. Define your workspace structure, naming conventions, and access controls before things get messy.

Don't ignore the learning curve for your team. Even if your people know Power Query and ADF, Fabric has enough new concepts (lakehouses, medallion architecture, OneLake, capacity management) that training is worthwhile. We've run workshops specifically on Fabric Data Factory for teams transitioning from ADF, and the ramp-up time is typically two to three weeks of focused learning.

Getting Help

Data Factory in Fabric is where modern Microsoft data integration is heading. The tooling is solid, the patterns are proven, and the integration with the rest of the Fabric ecosystem gives it advantages that standalone tools can't match.

If your organisation is planning a Fabric deployment or looking to migrate existing data pipelines, our Microsoft Fabric consulting team can help you plan and execute it properly. We've done this enough times to know where the hidden complexities live.

For broader data strategy - figuring out how Fabric fits alongside your existing data warehouse, your Power BI reports, and your AI initiatives - our data integration consultants can map out a practical roadmap.

And if you're exploring how AI fits into your data pipeline - using AI models to enrich data as it flows through Fabric - our AI consulting practice can help you think through where machine learning adds genuine value versus where it's just adding complexity.

Fabric Data Factory isn't perfect. The debugging experience needs work, the documentation has gaps, and some features still feel like they're catching up to ADF. But for new projects on the Microsoft stack, it's the right bet.