Moving Cosmos DB for NoSQL Data into Microsoft Fabric with Data Factory
A lot of Australian businesses end up with operational data sitting in Azure Cosmos DB and analytical ambitions sitting in Microsoft Fabric, and the gap between the two is where a surprising amount of consulting time goes. Cosmos DB is brilliant at what it does. It serves your app fast, at scale, with flexible JSON documents that don't need a fixed schema. But the moment someone wants to report across that data, join it to finance numbers, or feed it into a model, you need to get it out of Cosmos and into something built for analytics. That's where the Cosmos DB for NoSQL connector in Fabric Data Factory comes in.
We've done this move a few times now, and it's mostly straightforward once you understand what the connector is and isn't. Let me walk through how it works, where it fits, and the things that have tripped us up so you can skip the same lessons.
What the connector is for
Microsoft Fabric's Data Factory is the data movement and transformation engine inside Fabric. It's the part that pulls data from wherever it lives into your lakehouse or warehouse so the rest of Fabric, Power BI, notebooks, the SQL endpoint, can do something with it.
The Azure Cosmos DB for NoSQL connector is the specific bridge between your Cosmos DB account and Fabric. It lets you read documents out of a Cosmos container and write them into your Fabric environment, and it can also write back into Cosmos if you need to. In practice the read direction is what most people want: take the JSON documents your application has been generating, land them in a lakehouse, and turn them into tables you can actually analyse.
You use it in two main ways. A copy activity in a data pipeline is the workhorse, good for bulk and scheduled movement of data on a timetable. A dataflow gen2 is the option when you want to shape and clean the data with a more visual, Power Query style approach as you bring it across. Which one fits depends on the job, and I'll get to that.
This kind of plumbing is a chunk of what our Microsoft Fabric consultants do day to day, because getting data into Fabric cleanly is the unglamorous foundation everything else sits on.
How a typical move looks
Say you've got a Cosmos DB account holding customer activity, with each document a JSON record of an event. You want that in a Fabric lakehouse so the analytics team can join it to sales and build a churn model.
You start by setting up a connection to your Cosmos DB account. Fabric needs to know how to reach it and how to authenticate, which usually means an account key or, better, an Entra ID identity. Once the connection's in place, you point a copy activity at the container you want, set the lakehouse as the destination, and run it.
The connector reads the documents and lands them in your lakehouse, where they become queryable. From there the analytics team works in the lakehouse rather than hitting the live Cosmos database, which is exactly what you want. You're not putting analytical load on the database that's serving your application, and that separation is half the reason you're doing this in the first place.
For an initial load you typically pull the whole container. For ongoing movement you set up a schedule and bring across what's changed, so you're not re-copying everything every night. How cleanly you can do incremental loads depends on how your documents are structured, which is one of the gotchas worth its own section below.
Copy activity or dataflow gen2
People ask which to use, and the honest answer is it depends on what you need to do to the data on the way through.
If you mostly want to move documents from A to B on a schedule, with little or no transformation, the copy activity in a pipeline is the right tool. It's built for throughput and it handles large volumes well. You point it at the source, point it at the destination, schedule it, and it runs.
If you need to reshape the data as it lands, flatten nested JSON, rename fields, filter out junk, combine it with another source, dataflow gen2 gives you a visual transformation surface that's friendlier for that work. The trade-off is that it's generally better suited to moderate volumes and shaping work than to moving enormous datasets as fast as possible.
In real projects we often use both. A copy activity does the heavy lifting of getting raw documents into the lakehouse, and then transformation happens after landing, either in a dataflow or in notebooks. Landing raw first and transforming second is usually the more maintainable pattern, because your raw data stays intact and you can rebuild your transformations without re-hitting Cosmos.
The gotchas worth knowing
Here's where the experience pays off, because the connector is straightforward but document data has its own personality.
Schema drift is the big one. Cosmos DB doesn't enforce a schema. That's a feature for your application and a headache for analytics. Two documents in the same container can have different fields, different nesting, different types for the same field name. When you pull that into a lakehouse, which wants something more tabular, the mismatch surfaces. You have to decide how to handle documents that don't all look the same, and if you ignore this you'll get a load that either fails or quietly drops fields you cared about. Look at your actual documents before you assume they're uniform. They usually aren't.
Nested JSON needs flattening. Cosmos documents are happily nested, objects inside objects, arrays of things. Analytics tools generally want flatter tables. Part of the job is deciding how to unpack that nesting, which arrays to explode into rows, which nested objects to pull up into columns. This is real design work, not a checkbox, and getting it wrong early means painful rework later.
Request units and cost. Cosmos charges in request units, and a big read against a large container consumes a lot of them. If you fire off a full copy during business hours you can compete with your live application for throughput and either slow the app down or run up a bill. Schedule big loads for quiet periods, and think about what your incremental strategy is so you're not re-reading the whole container constantly. This is the sort of thing we plan up front on data factory work, because a naive setup can cost real money on a busy account.
Incremental loading isn't free. Bringing across only what changed sounds simple, but it depends on your documents carrying something reliable you can filter on, like a timestamp or a change marker. If your documents don't have a dependable way to identify what's new or changed, you're stuck either re-loading everything or doing more engineering to track changes. Worth checking before you promise anyone a tidy nightly incremental.
Authentication setup. Account keys are the quick way to get going, but for anything heading to production you want an Entra ID identity with the right permissions instead. It's a bit more setup and worth doing properly, because account keys scattered through pipelines are exactly the sort of thing that comes back to bite you in a security review.
Where this fits in the bigger picture
Pulling Cosmos data into Fabric is rarely the whole project. It's usually one source among several. The point of landing it in a lakehouse is that, once it's there, it sits alongside everything else, your finance data, your CRM, your operational systems, in one place where you can join across all of it and build reporting or models on top.
That's the real value. Cosmos DB on its own gives you fast operational access to one slice of data. Getting it into Fabric is what lets you ask questions that span your whole business instead of one application's database. The connector is the unglamorous bit of plumbing that makes the interesting work possible, and like most plumbing, you only notice it when it's done badly.
If you're at the stage of designing a Fabric setup and wondering how to bring operational data from Cosmos and elsewhere together sensibly, that's squarely the kind of thing we help with. Have a look at how we approach business intelligence, or get in touch and we can talk through your data sources and what a clean Fabric foundation would look like.
Reference: Azure Cosmos DB for NoSQL connector overview, Microsoft Fabric Data Factory documentation.