Back to Blog

Loading 1 TB of Parquet to a Fabric Warehouse with Staging - Real Cost Walkthrough

May 11, 20269 min readMichael Ridland

The most common question I get from Australian finance directors when they are evaluating Microsoft Fabric is something like "is this going to be cheaper than what we have now, or are we just rearranging the deck chairs?" It is a fair question. Fabric pricing is not bad once you understand it, but the documentation talks in CU seconds and the CFO talks in dollars, and bridging that gap is part of the job.

Microsoft has a worked example in the Fabric docs for loading 1 TB of Parquet data from Azure Data Lake Storage Gen2 into a Fabric warehouse using a Copy activity with staging. The landed cost is about $13.37 USD at pay-as-you-go pricing in West US 2. That number is useful but it needs a bit of context to be honest with yourself about what it really tells you.

What is actually in that $13.37

The scenario uses a single Copy activity in a pipeline. The activity moves 1 TB of Parquet data from ADLS Gen2 to a Fabric warehouse, with staging enabled. Staging means the data lands in an intermediate storage area before being loaded into the warehouse tables. For larger loads or for loads where the destination cannot directly read the source format, staging is the recommended pattern.

The metrics:

  • 267,480 CU seconds consumed by the data movement operation
  • That works out to 74.3 CU-hours
  • At $0.18 per CU-hour the run costs about $13.37
  • The wall-clock duration was 1504 seconds, or just over 25 minutes

A few things worth noticing. The duration of the run is reported as a metric but it does not directly drive cost. The CU seconds metric is what you pay for, and it already factors in duration. Two runs that take different amounts of time can still cost the same in CU seconds. That is genuinely useful to know because it means you should not pay for faster compute unless you actually need the faster wall-clock outcome.

Also worth noticing: the activity run cost is null because there are no non-copy activities in this pipeline. If you add a lookup, a stored procedure call, or a notebook step before or after the copy, those will accumulate their own CU seconds on top of the data movement.

How this compares to the CSV scenario

The Microsoft docs have a similar scenario for loading 1 TB of CSV data into a Fabric Lakehouse, and the cost there is about $14.11. So loading the same volume as Parquet into a Warehouse with staging is actually slightly cheaper than loading as CSV into a Lakehouse. That is not because the Warehouse is magic. It is because Parquet is column-oriented and compressed, so the actual data movement work is more efficient even for nominally the same row count.

This is a small example of a pattern I see often with clients. The format of the data you are landing matters a lot more than the destination you are landing it in. When we work with Australian clients on Fabric Data Factory implementations, one of the early conversations is about whether the source systems can produce Parquet or whether they are stuck producing CSV. If they can produce Parquet, you usually save money on every ingest run for the lifetime of the platform.

Why staging actually matters

The Microsoft scenario specifies staging is enabled. That is doing something important.

Without staging, the Copy activity reads from the source and writes directly to the destination. For a warehouse destination, that means each row has to be inserted through the warehouse's transactional path. For 1 TB of data that is millions of small writes and the overhead is substantial.

With staging, the Copy activity first lands the data in an intermediate storage location (usually a Fabric-managed area), then issues a bulk load command to the warehouse to pull the data in. The bulk load path is significantly more efficient than per-row inserts. For large loads this is the pattern you want.

For small loads (anything under say 5 GB) you do not need staging. The overhead of the staging hop costs more in time than it saves. But for the kind of 1 TB scenario this pricing example describes, staging is essentially mandatory.

The trap to watch for here is that some clients leave staging off by default and only enable it for their "big" loads. Then a daily load that used to be 100 MB grows over six months to 80 GB and is suddenly running slow and consuming more capacity than expected. We had a client where this exact pattern was the cause of nightly pipeline failures. Six lines of configuration would have prevented it.

The Australia East price difference

This is the question that always comes up. The pricing example uses West US 2 at $0.18 per CU-hour. What does it look like in Australia?

The honest answer is that Fabric capacity pricing varies by region and you should always check the Microsoft Fabric pricing page for your specific region before quoting any number to your finance team. Australia East and Australia Southeast both have their own per-CU-hour rates and reserved capacity options.

For most Australian production deployments, you also do not want to be running pay-as-you-go anyway. Reserved capacity is meaningfully cheaper if you can commit to a one-year or three-year term, and for any workload that runs daily that commitment is usually easy to justify. The pay-as-you-go pricing in the worked example is a unit cost demonstration, not a recommendation.

So when you take this $13.37 number into your own planning, think of it as "an indication that data movement is not the expensive part of Fabric". The expensive part is everywhere else.

Where Fabric projects actually leak money

Across the Australian Fabric projects I have seen go well or poorly, the data movement cost has almost never been the issue. The patterns that hurt budgets are different.

Refresh schedules that are too aggressive. A daily refresh that needs to run within four hours of midnight is one set of CU consumption. A "near real time" refresh that runs every fifteen minutes is a very different number, and most business users cannot tell you why they need fifteen minutes instead of an hour.

Dataflows Gen2 doing transformation work that should be in T-SQL. Dataflows are convenient and Power Query developers know them well, but the CU consumption per row of work is meaningfully higher than running the same transformation in a stored procedure or a notebook. For one-off transforms it does not matter. For high-volume daily transformations it adds up.

Semantic models with no aggregation strategy. A Power BI report against a 500 GB Fabric Warehouse with no aggregations defined will issue queries that scan a lot of data. Each of those scans consumes capacity. Designing the aggregations up front is one of the cheapest things you can do to keep a Fabric tenant running efficiently. This is something we cover in any serious Power BI consulting engagement.

Notebooks running on Spark compute when they could run on the SQL endpoint. Spark is great when you need it. For straightforward ETL work it is usually overkill, and the capacity overhead is real.

How to read your own Capacity Metrics App

The Microsoft worked example is using the Fabric Capacity Metrics App to attribute cost. That is the right tool for monitoring real workloads too. If you are running Fabric in production and you are not looking at the Capacity Metrics App weekly, you are flying blind.

The metric to focus on is CU seconds by operation type and by workload. That tells you whether your capacity is being consumed by Data Factory pipelines, Dataflows, Notebooks, semantic models, Lakehouse queries, or something else. Once you know where the consumption is, you can decide whether it is appropriate or whether something needs to be tuned.

One pattern worth setting up early: alerting on capacity throttling events. When your tenant exceeds its allocated CU for a sustained period, Fabric will throttle operations to bring it back in line. Throttling looks like "the platform is broken" to end users. Setting up a Teams alert when throttling occurs lets you respond before users start raising tickets.

What a 1 TB scenario tells you about your real workload

The Microsoft example is one Copy activity moving one Parquet dataset once. Real Fabric workloads are dozens of pipelines, hundreds of refresh runs, thousands of report queries per day. The unit cost from the worked example is useful as a sanity check, but it does not size your capacity for production.

For Australian businesses sizing Fabric capacity from scratch, the pattern I would recommend is:

  1. Start with an F4 or F8 capacity for development and a smaller capacity for a pilot
  2. Run real workloads (not synthetic benchmarks) for two to three weeks
  3. Watch the Capacity Metrics App to understand peak consumption
  4. Right-size the production capacity based on actual peak data, with about 30% headroom
  5. Switch to reserved capacity once the sizing is stable

If you skip the pilot and just size based on industry benchmarks or a sales engineer's spreadsheet, you will either over-provision (which is expensive but recoverable) or under-provision (which is painful and visible to users).

The bottom line

The $13.37 number in the Microsoft pricing example is real and it is genuinely cheap for moving 1 TB of Parquet data. It tells you that data movement on Fabric is not the part to worry about. The parts to worry about are everything that happens to the data after it lands, and the operational patterns that determine how often that movement happens.

For Australian businesses looking at Fabric for the first time, the right question is not "what will the pipelines cost" but "what is the all-in capacity cost of running my analytics workload at the scale I actually need". That requires a pilot, real workloads, and someone who has seen a few of these projects from inside.

If you would like help working through that, get in touch via our contact page.

Reference: Pricing scenario using a pipeline to load 1 TB of Parquet data to a data warehouse with staging.