How to Build Your First Data Pipeline with Data Factory - A Step-by-Step Walkthrough
If you have just been handed Azure Data Factory or Fabric Data Factory and asked to build "a pipeline," you are probably about to spend more time than you should on something that looks simple on the surface. The tutorials online tend to skip the bits that actually matter in production, like error handling, parameter design, and how to avoid hard-coding your way into a corner.
This post is the walkthrough I wish someone had handed me the first time I built a Data Factory pipeline. It is opinionated, based on what we have seen work and fail across dozens of engagements, and it covers both Azure Data Factory (ADF) and Fabric Data Factory because they overlap heavily but have a few differences worth knowing.
Decide Which Data Factory You Should Be Using
Before you even open the portal, there is a fork in the road.
Azure Data Factory is the mature standalone product. It has been around since 2015. It is the right tool if you are integrating data across Azure services and your destination is Azure SQL, Synapse, Cosmos DB, or Azure Data Lake. It supports SSIS package migration. It has a deep connector library.
Fabric Data Factory is the integrated version inside Microsoft Fabric. It is the right tool if your destination is a Fabric Lakehouse or Warehouse, or if your organisation has already committed to Fabric. The UI is similar but not identical. Fabric Data Factory has Dataflow Gen2 as a first-class transformation tool, which ADF does not.
If you are unsure, our Azure Data Factory vs Fabric Data Factory article covers the trade-offs in detail. For a first pipeline, either will work. We will note differences as we go.
What You Need Before You Start
For Azure Data Factory:
- An Azure subscription (a free trial is fine for learning)
- An Azure resource group
- A Data Factory instance, which takes about two minutes to create from the portal
- One source system to read from (Azure Blob Storage is the easiest starting point)
- One destination to write to (Azure SQL Database or Azure Data Lake Storage Gen2 are both fine for a first project)
- An Azure Key Vault is strongly recommended even for a first pipeline, because hard-coded credentials are a mistake you will not be able to undo cleanly later
For Fabric Data Factory:
- A Microsoft Fabric capacity (a trial capacity is enough)
- A Fabric workspace
- A source data system
- A Fabric Lakehouse or Warehouse as the destination
You will also need permissions. The most common reason a first pipeline fails is that the identity running the pipeline does not have access to the source or destination. Sort this out before you start, not after.
The Example We Will Build
To keep this concrete, we will build a pipeline that does something realistic:
- Reads CSV files from a folder in Azure Blob Storage
- Loads them into a staging table in Azure SQL Database
- Runs a stored procedure to merge staging into a target table
- Logs the run to a metadata table
- Sends a Teams message if anything fails
This is not the simplest possible pipeline. It is the simplest pipeline that is actually useful in production. If you are doing less than this, you might as well use a SQL Agent job.
Step 1 - Set Up Linked Services
Linked Services in Data Factory are the connection definitions. Think of them as connection strings with extra structure.
For our example, you need:
- A linked service for Azure Blob Storage (the source)
- A linked service for Azure SQL Database (the destination)
- Optionally, a linked service for Azure Key Vault (to hold credentials)
Create them through the Manage tab in ADF Studio, or the Workspace settings in Fabric. The key decisions here:
Authentication method. Use Managed Identity wherever possible. Account keys and SQL passwords work, but Managed Identity means no credential rotation and no secrets to leak. The Data Factory's system-assigned managed identity needs to be granted access on the source and destination resources. This takes 10 minutes to set up properly the first time and saves you hours of pain later.
Integration runtime. For Azure-to-Azure pipelines, the default AutoResolveIntegrationRuntime is fine. If your source is on-premises, you need a Self-Hosted Integration Runtime. We have a separate post on connecting Data Factory to on-premises sources if that applies to you.
Naming convention. Pick one and stick to it. Our default is ls_[type]_[purpose]. So ls_blob_source_csv and ls_sql_staging. You will create dozens of linked services eventually. Naming matters.
Step 2 - Create Datasets
Datasets are the schema and location definitions on top of a linked service. They tell Data Factory "this CSV file with these columns lives at this path."
For our example:
- A dataset for the source CSV files (DelimitedText format, pointing at the blob container)
- A dataset for the staging SQL table (Azure SQL Database table)
When you create the CSV dataset, parameterise the file path. Do not hard-code mycontainer/data/2026-05-02.csv. Make the date or filename a parameter. This is the single most important thing you can do early. Pipelines that take parameters are reusable. Pipelines with hard-coded paths are not.
For the SQL dataset, you can either point at a specific table or parameterise the table name. For a first pipeline, point at a specific table. Generic table loading is a pattern worth learning later, but it adds complexity that obscures the basics.
Step 3 - Build the Copy Activity
This is the actual data movement. In the Data Factory authoring canvas, drag a Copy data activity onto the design surface.
In the Source tab, select your CSV dataset and pass in the parameter for the filename or path. You will use dynamic content here. For example, @concat('inbound/', formatDateTime(utcnow(), 'yyyy-MM-dd'), '.csv') will read today's file.
In the Sink tab, select your staging SQL dataset. Set the write behavior to "Truncate" so the staging table starts clean each run.
In the Mapping tab, you can either let Data Factory auto-map the columns or explicitly map source to destination. Explicit mapping is more work but it fails loudly if the source schema changes. Auto-mapping fails silently, which is worse. For production pipelines, always use explicit mapping.
In the Settings tab, set a reasonable retry count (3 is fine) and a retry interval (30 seconds is fine).
Step 4 - Add the Stored Procedure Activity
After the Copy activity, add a Stored Procedure activity. This is where you merge staging into your target table.
Connect the Copy activity's success output (green arrow) to the Stored Procedure activity. This means the proc only runs if the copy succeeded.
Inside the Stored Procedure, you write standard T-SQL. A typical merge looks like this:
MERGE target_table AS t
USING staging_table AS s
ON t.id = s.id
WHEN MATCHED AND (
t.col1 <> s.col1 OR
t.col2 <> s.col2
) THEN UPDATE SET
t.col1 = s.col1,
t.col2 = s.col2,
t.updated_at = GETDATE()
WHEN NOT MATCHED BY TARGET THEN
INSERT (id, col1, col2, created_at)
VALUES (s.id, s.col1, s.col2, GETDATE());
A few notes from experience:
- The MERGE statement above is intentionally simple. Real ones get longer. Try not to put complex business logic in the MERGE itself, because it gets hard to debug.
- Always handle the case where the source has duplicates. Either dedupe in the Copy activity using a Data Flow, or use a CTE inside the MERGE to dedupe first.
- Wrap the MERGE in a transaction with explicit error handling. If the MERGE fails halfway through, you want it rolled back, not partially applied.
Step 5 - Add Logging
Most tutorials skip this. It is the difference between a hobby pipeline and a production one.
Add a Lookup activity at the end that calls a stored procedure to insert a row into a pipeline_runs metadata table. Pass in the pipeline run ID using @pipeline().RunId, the start time using @pipeline().TriggerTime, the row count from the Copy activity output, and a status of 'Success'.
Now add a failure path. From both the Copy activity and the Stored Procedure activity, draw a failure arrow (red) to a Web activity that posts to a Teams webhook. Include the pipeline name, the run ID, the error message, and a link back to the run in the Data Factory monitoring UI.
This gives you two things you will appreciate within a month:
- A queryable history of every pipeline run, with row counts and timings
- An alert when something fails, with enough context to start diagnosing
Without these, you find out about failures when someone notices the dashboard is stale. With these, you find out within seconds.
Step 6 - Parameterise the Pipeline
Open the pipeline parameters and add a parameter called run_date of type String, with a default value of @utcnow().
Update your Copy activity's dynamic file path to use this parameter instead of utcnow() directly. The change is small. The benefit is that you can now manually trigger the pipeline for a specific date if you ever need to backfill a missed run.
We see this skipped constantly, and it always comes back to bite teams when they need to reprocess data. Parameterise from day one.
Step 7 - Set Up the Trigger
Triggers control when the pipeline runs. The four types:
- Schedule trigger: cron-like, runs on a recurrence
- Tumbling window trigger: time-bucketed, with built-in dependency and retry handling
- Storage event trigger: fires when a blob is created or deleted
- Custom event trigger: fires on Event Grid events
For a first pipeline, a Schedule trigger is usually right. Set it to run at the cadence you need. If you are loading daily files, run it once a day at a fixed time. Do not run it every 15 minutes "just in case." Empty runs cost money and clutter the monitoring view.
If you need event-based triggering (file arrives in blob storage, pipeline kicks off immediately), use a Storage event trigger. These work well but watch for cases where multiple files arrive in quick succession and trigger overlapping runs.
Step 8 - Test It
Run the pipeline manually using the Debug button before you publish. Watch the activity outputs in the lower pane. Verify:
- The Copy activity reports the row count you expect
- The Stored Procedure activity completes without error
- The metadata logging row is inserted
- A deliberately broken file triggers the Teams alert
Test failure paths explicitly. Drop a malformed CSV file in the source folder and run the pipeline. Make sure the failure path fires and the alert lands in Teams. If you only ever test the happy path, you will not know your error handling is broken until production.
Step 9 - Source Control
This is the step everyone wants to skip. Don't.
Connect your Data Factory to an Azure DevOps Git repo or a GitHub repo. In ADF, this is done from the Manage tab. In Fabric Data Factory, source control integration is improving but still less mature than ADF, so use it where it is available.
Once connected, all your changes are saved to the repo as JSON. You can branch, do pull requests, review changes, and roll back. Without source control, the only history you have is whatever the audit log gives you, which is not enough to recover from a bad change.
Common Mistakes on a First Pipeline
A handful of patterns we see almost every time someone is building their first Data Factory pipeline:
Hard-coding paths and table names. Already covered, but it bears repeating. Parameterise from the start.
Using account keys instead of Managed Identity. Fine for a demo, painful for production. Use Managed Identity.
No source control. You will eventually break something and need to roll back. Without Git, this is much harder.
Using a Web activity for everything. Web activities are useful, but if you find yourself making API calls to Azure services from a Web activity when there is a built-in activity for the same purpose, use the built-in activity. It handles authentication and retries better.
Building one giant pipeline. Pipelines should be small and composable. If your pipeline has 30 activities, split it into multiple pipelines and call them with Execute Pipeline activities.
Ignoring the Integration Runtime sizing. For Azure-to-Azure copies, the default is fine. For self-hosted IRs, sizing matters. If your on-prem source is slow, the IR might be the bottleneck.
Not setting a timeout on activities. Default is 7 days. A Copy activity that runs for 7 days is a Copy activity that has been stuck for 7 days and is racking up cost. Set realistic timeouts, usually 1 to 2 hours for a Copy activity.
Cost Awareness
For a first pipeline like the one above, running daily on small files (a few hundred MB), you are looking at roughly $5 to $30 AUD per month in Data Factory orchestration costs, plus whatever the SQL Database costs separately.
The pricing has two main components: pipeline orchestration activity runs (cheap, a few cents per 1000 activity runs) and data movement (priced per DIU-hour). Most first pipelines are dominated by the DIU-hour cost on the Copy activity. For small data volumes, this is negligible. For larger volumes (hundreds of GB per day), it becomes worth tuning.
Our Data Factory implementation cost guide for Australia covers cost optimisation in more depth.
When You Should Bring in a Consultant
You can absolutely build a first pipeline yourself. The Microsoft documentation is solid and the platform is approachable. We typically get called in when:
- The first pipeline becomes 50 pipelines and the team realises they need a framework, not just a collection of one-off jobs
- The data volumes have grown past what the original design supports and performance is dropping
- There are compliance or security requirements (PII handling, data residency) that need a more rigorous architecture
- The team has built something that works but is hard to maintain, and they want a review and refactor
If you are starting out and stuck, you might also just want a few hours of advice. We do short engagements where we walk a team through their architecture, point out the issues, and leave them with a clear path forward. Not every problem needs a 12 week project.
If you would like a hand getting your Data Factory environment set up properly, or you want a second opinion on an existing setup, get in touch with our Microsoft Data Factory consultants. You can also see our other data and AI services or contact us directly. We work with organisations across Sydney, Melbourne, Brisbane, and elsewhere in Australia.