Back to Blog

Configuring the Copy Activity in Microsoft Fabric Pipelines - A Practical Guide

June 19, 20268 min readMichael Ridland

Almost every data project I've worked on starts in the same place: someone needs to get data out of one system and into another. A sales database into a warehouse. A flat file from a vendor into a lakehouse. An on-premises SQL Server into the cloud so the analysts can finally stop running reports off the production box at 6am. The Copy activity in Microsoft Fabric Data Factory is the workhorse that does this, and getting it right is the difference between a pipeline you set and forget and one that pages you every Tuesday.

We do a lot of Fabric work for Australian organisations, and the Copy activity comes up on basically every engagement. It looks simple in the UI, and for the easy cases it genuinely is. But there's a gap between "I dragged a Copy activity onto the canvas and it ran once" and "this thing reliably moves 40 million rows every night without falling over." This post is about closing that gap. Microsoft's reference documentation for configuring the Copy activity is worth keeping open while you build, but here's how we think about it.

What the Copy Activity Actually Does

At its core the Copy activity reads from a source and writes to a sink. That's it. You point it at where the data lives, you point it at where you want the data to go, and Fabric handles the movement, the type conversion, and the plumbing in between.

The reason it feels more complicated than that is the sheer number of connectors. Fabric supports well over a hundred sources and destinations, from Amazon RDS for SQL Server through to Snowflake, Salesforce, blob storage, REST APIs, and the Fabric Lakehouse and Warehouse themselves. Each connector has its own quirks, its own authentication options, and its own performance characteristics. The activity is the same; what changes is how you configure the two ends.

When I'm walking a client team through this for the first time, I tell them to stop thinking about it as one big complicated thing and instead think about three separate questions. Where is the data coming from? Where is it going? And what needs to happen to it in transit? Answer those three and the configuration falls out naturally.

Setting Up the Source

The source tab is where you tell the activity what to read. For a database source like Amazon RDS for SQL Server, you'll create a connection that holds the server address, the database name, and the credentials. My strong advice here, learned the hard way, is to use a proper credential store rather than typing passwords into the connection dialog. Fabric integrates with Azure Key Vault, and you should use it. The number of times I've seen a pipeline break because someone rotated a database password and forgot the connection had it hardcoded is too high to count.

Once the connection is in place you choose what to pull. You've got two broad options. You can select a table directly, which is the simplest path and fine for smaller datasets. Or you can write a query. For anything substantial, write a query. A query lets you filter at the source so you're only moving the rows you actually need, and that single decision often does more for pipeline performance than any amount of tuning downstream.

This is where incremental loading comes in, and it's the thing most teams underuse. Instead of copying the entire table every run, you copy only the rows that changed since last time, usually by filtering on a modified-date column or a watermark value. For a table with a few thousand rows it doesn't matter. For a table with tens of millions, full reloads every night are how you burn through your capacity and end up wondering why the bill is so high. We almost always build incremental logic into production copy pipelines, and we almost always have to retrofit it onto pipelines someone built quickly without thinking about scale.

Setting Up the Sink

The sink is the destination. If you're landing data in a Fabric Lakehouse, you'll write to a table or a file in OneLake. If it's a Warehouse, you're writing to a structured table. Cloud storage, another database, wherever.

The setting people overlook here is the write behaviour. Are you appending to the destination, or replacing it? Appending blindly is how you end up with duplicate rows after a re-run. Replacing is cleaner but throws away history. For incremental loads you usually want an upsert pattern, where existing rows get updated and new rows get inserted, keyed on some unique column. Fabric supports this, but you have to configure it deliberately. It is not the default, and the default will quietly let duplicates accumulate until someone notices the numbers don't add up.

For large loads into a Warehouse, there's a staging option worth knowing about. Fabric can stage the data in intermediate storage and then bulk-load it, which is dramatically faster than row-by-row inserts. If you're moving serious volume and it feels slow, check whether staging is switched on. It often isn't, and turning it on can take a load from forty minutes to five.

Mapping the Columns

Between source and sink sits the mapping. By default Fabric tries to auto-map columns by name, and for tidy schemas that works fine. Source has a column called CustomerID, sink has a column called CustomerID, they line up, done.

Real-world schemas are rarely that tidy. You'll have a source column called cust_id that needs to land in a column called CustomerKey. You'll have date fields stored as strings that need converting. You'll have columns in the source you don't want at all. The mapping tab is where you sort all of that out, explicitly pairing each source field to its destination and setting the type conversions.

My honest opinion: spend the time on the mapping even when auto-map seems to work. The failure mode of relying on auto-map is subtle. Everything runs fine until someone adds a column to the source table, the mapping shifts, and suddenly your data is landing in the wrong columns without a single error being thrown. Explicit mappings are a bit more work upfront and they save you from the worst kind of bug, the one that doesn't announce itself.

If you want help thinking through how all this fits into a broader data platform, our Microsoft Fabric consultants do exactly this kind of work day in and day out, and our dedicated Data Factory consulting team lives in these pipelines.

Performance, Reliability, and the Things That Bite

A few hard-won lessons from production.

Watch your degree of parallelism. The Copy activity can read and write in parallel, which is great for throughput but can also hammer your source system. If you crank parallelism up against a production database that's also serving live traffic, you can degrade the application you're copying from. We tune this carefully, and for on-premises sources behind a data gateway we tune it conservatively.

Set sensible timeouts and retries. Networks are unreliable, especially when you're pulling from on-premises systems or third-party APIs that occasionally have a bad five minutes. A Copy activity with a single attempt and no retry will fail the whole pipeline over a transient blip. Configure a couple of retries with a sensible interval and most of those transient failures heal themselves.

Use fault tolerance for messy data. If you're copying from a source where you know some rows are malformed, you can configure the activity to skip incompatible rows and log them rather than failing the entire run. This is a judgement call. Skipping bad rows silently can hide a real data quality problem, so when we use it we always log the skipped rows somewhere a human will actually look.

Monitor it. The pipeline run history in Fabric shows you rows read, rows written, throughput, and duration for every Copy activity. Get in the habit of glancing at this. A run that suddenly moves twice the usual number of rows, or takes three times as long, is telling you something changed. Catching that early beats finding out when a report breaks.

Where This Fits

The Copy activity is rarely the whole story. It's usually one step in a larger pipeline that might trigger downstream transformations, refresh a semantic model, or kick off notifications. But it's the foundation. Get your data movement solid and reliable, and everything you build on top of it inherits that reliability.

The mistake I see most often isn't technical. It's treating data movement as a one-off setup task rather than an ongoing piece of infrastructure that needs the same care as any other production system. A copy pipeline that runs every night for two years will outlive whatever urgent reason prompted you to build it, and the half hour you spend getting the incremental logic and error handling right will pay for itself many times over.

If you're standing up Fabric for the first time, or you've got pipelines that were built fast and now need to be made trustworthy, that's the kind of work we do. Have a look at how we approach data and AI platform work, or just get in touch and tell us what you're trying to move and where. Most of these conversations start with someone describing a nightly job that's become a liability, and they usually end with a pipeline nobody has to think about anymore.