Power BI Dataflows - How to Develop Solutions That Actually Scale
Most Power BI projects start small. A few reports, a direct connection to a SQL database, maybe a couple of imported datasets. But the moment an organisation starts sharing data across teams - or needs to apply consistent business logic to a dozen different reports - things get complicated fast.
That's where dataflows come in. Power BI dataflows are essentially a data prep layer that sits between your raw data sources and your semantic models. Think of them as reusable, centrally managed data transformations. Instead of every report author writing their own version of "clean up the customer table and apply region mappings," you define it once in a dataflow and everyone references the same output.
We've implemented dataflows across a range of Australian organisations at Team 400, and I want to share some honest observations about what works, where the friction is, and how to get the most out of them.
Premium vs Pro - the real difference
Let me be blunt about this: dataflows on a Pro licence are limited. They work for small-scale, straightforward scenarios, but the features that make dataflows genuinely useful in enterprise settings are Premium-only.
Here's what you get with Premium that you don't get with Pro:
- Advanced compute - accelerates ETL performance and enables DirectQuery against dataflows
- Incremental refresh - only loads data that has changed since the last refresh
- Linked entities - reference tables from other dataflows
- Computed entities - build composable logic blocks that chain together
Incremental refresh alone is worth the Premium investment for any dataflow touching a large data source. Without it, every refresh reloads the entire dataset from scratch. For a table with millions of rows, that's the difference between a 5-minute refresh and a 45-minute one.
Which Premium model do you choose?
Microsoft offers two Premium paths: Premium per capacity and Premium per user (PPU). The decision comes down to team size.
For organisations with more than 5,000 users, Premium per capacity is the straightforward choice. You designate a capacity to your workspace, and individual users bring their own Pro licences. Your consumers in Power BI Desktop don't need special licences to connect and build on the dataflows.
For smaller teams, PPU can make sense. But here's the catch that trips people up: PPU is an all-or-nothing environment. Every person who creates or consumes PPU content needs a PPU licence. You cannot mix PPU with Free, Pro, or even Premium capacity content. I've seen organisations adopt PPU for a data team of 15 people, only to discover they can't share the resulting reports with the 200-person sales team on Pro licences without migrating the workspace.
My recommendation: if you're a smaller team that needs Premium features and you're confident your audience will stay within the PPU boundary, go for it. If there's any chance you'll need to share broadly, invest in a Premium capacity instead. The flexibility is worth it.
Security patterns - linked entities vs computed entities
This is where things get interesting. One of the most common requests we see at Team 400 is "we want to centralise data preparation but control who sees what."
The obvious approach is to use linked entities. Create your master dataflows in a back-end workspace, then create linked entities in user-facing workspaces. Users get viewer access to the consumer workspace, no access to the back-end workspace. Clean separation.
Except it doesn't work the way you'd expect.
Linked entities are pointers, not copies. They inherit permissions from the source. So if a user connects to a linked entity through Power BI Desktop, they'll see the dataflow in the Navigator but the tables will appear empty. The permissions haven't changed just because you put a pointer in a different workspace. Microsoft designed it this way deliberately - if linked entities used destination permissions, anyone could bypass source security by creating a linked entity pointing to restricted data.
The fix - computed entities
If you're on Premium, computed entities solve this problem properly. A computed entity in the destination workspace creates an actual copy of the data from the linked entity. You can then filter rows, remove columns, and shape the output however you need. Users with permission on the destination workspace access data through the computed entity, and they never need access to the source.
Here's a practical example. Say you have a national sales dataflow in a central workspace. You create three user workspaces - NSW, VIC, and QLD - each with computed entities that filter the sales data to the relevant state. The NSW team sees NSW data. They don't even know the national dataset exists unless they have privileged access to the lineage view.
This pattern works well. The main thing to watch out for is refresh scheduling. Because computed entities copy data, they need to refresh after the source dataflow completes. You end up with a dependency chain: source refreshes at 2am, downstream computed entities refresh at 3am. Build in enough buffer time, and test your refresh windows with production-scale data - not just your dev sample.
Reducing refresh times - the enable load trap
Here's something that burned us on a project early on, and I see other teams make the same mistake.
When you have a large dataflow with tables that are used as building blocks for other tables, it's tempting to disable load on the intermediate tables. The thinking goes: "I don't need this table in the output, it's just an ingredient for the final calculation." Disabling load means Power BI won't evaluate that query as a standalone table, which should save time.
In practice, disabling load on referenced tables can make things slower.
When a table has load disabled but is referenced by other dataflows, Power BI treats it as if it doesn't exist as a cached result. Instead of looking up the pre-computed output, the engine re-evaluates the entire query chain from scratch every time. A join between two "disabled load" tables becomes a join of two full data source queries. You lose all the query folding and compute engine optimisations that Premium provides.
The better approach: keep enable load on, make sure the compute engine in your Premium capacity is set to Optimised (which is the default), and let Power BI cache and optimise the intermediate results. You also preserve full lineage visibility, which matters when you're troubleshooting refresh failures at 3am.
DirectQuery against dataflows
This one is worth calling out because it solves a genuine operational headache.
With DirectQuery enabled against a dataflow, you don't need to import data into your semantic model at all. Your reports query the dataflow directly. This means:
- No separate refresh schedule for the semantic model. The dataflow refreshes, and your reports see the updated data automatically.
- Smaller semantic models, because you're not duplicating data into an import model.
- Filtering happens at query time, so you can work with subsets of a large dataflow without importing everything.
To enable DirectQuery, you need the enhanced compute engine (ECE) in your Premium capacity set to On (not just Optimised - it has to be explicitly On). When you first enable it, you'll need to trigger a refresh so Power BI writes the data to both storage and the managed SQL engine. That initial refresh will be slower than usual. Subsequent refreshes are normal.
One thing to be aware of: DirectQuery performance depends on the ECE's capacity. If you're on a small Premium SKU and you have 50 users hitting DirectQuery reports simultaneously, you might find the queries slower than a well-tuned import model. Test with realistic concurrency before committing to DirectQuery for a high-traffic dashboard.
When dataflows make sense (and when they don't)
Dataflows are a good fit when:
- Multiple reports need the same prepared data
- You want to separate data preparation from report building
- Different teams need different slices of the same source data
- You need to apply consistent business logic across the organisation
Dataflows are probably overkill when:
- You have one report connected to one data source
- Your data prep is minimal (a few column renames and type changes)
- You're prototyping something quickly and don't need reusability yet
I've seen organisations try to put everything into dataflows because it seemed like the "right" architecture. It added complexity without adding value. Start with direct connections, and move to dataflows when you hit the point where multiple teams are duplicating the same data preparation work. That's the inflection point where the investment pays off.
Getting started
If you're exploring dataflows for the first time, Microsoft's official documentation on developing dataflow solutions covers the full feature set in detail.
For organisations that want help designing a dataflow architecture that fits their specific data landscape - especially around security patterns and refresh optimisation - get in touch with our team. We've done this enough times to know where the common pitfalls are, and more importantly, how to avoid them.