Back to Blog

How to Connect Data Factory to On-Premises Data Sources

April 21, 202611 min readMichael Ridland

Despite the cloud migration wave, most Australian businesses we work with still have critical data sitting on-premises. SQL Server databases, Oracle instances, SAP systems, network file shares, AS/400 systems - the data that runs your business often hasn't moved to the cloud yet.

Connecting Data Factory to these on-premises sources is one of the most common requirements we see, and one of the most common sources of project delays. The technology works well once it's set up correctly, but getting through the networking, security, and infrastructure requirements takes more effort than most teams expect.

Here's how to do it properly.

The Two Options for On-Premises Connectivity

Option 1 - Self-Hosted Integration Runtime (ADF)

The self-hosted integration runtime (SHIR) is an agent you install on a Windows machine inside your network. It acts as a bridge between Data Factory in the cloud and your on-premises data sources.

How it works:

  1. You install the SHIR software on a Windows server (physical or VM) that can reach your on-premises data sources
  2. The SHIR registers with your Azure Data Factory instance
  3. When a pipeline needs on-premises data, Data Factory sends the instruction to the SHIR
  4. The SHIR connects to the on-premises source, extracts the data, and pushes it to the cloud destination
  5. All communication is outbound from the SHIR to Azure (HTTPS on port 443) - no inbound firewall rules needed

This is the recommended approach for Azure Data Factory. It's mature, well-documented, and handles most on-premises connectivity scenarios.

Option 2 - On-Premises Data Gateway (Fabric)

Fabric Data Factory uses the on-premises data gateway (the same gateway used by Power BI, Power Apps, and Logic Apps). It works similarly to the SHIR but is designed for the Microsoft 365 and Fabric ecosystem.

Key differences from SHIR:

  • Shared across Fabric, Power BI, Power Apps, and Logic Apps
  • Managed through the Power Platform admin centre
  • Supports gateway clusters for high availability
  • Different connector support (aligned with Power Query connectors)

If you're using Fabric Data Factory, the on-premises data gateway is your only option for on-premises connectivity.

Setting Up the Self-Hosted Integration Runtime

Infrastructure Requirements

Hardware (minimum):

  • 4 CPU cores
  • 8 GB RAM
  • 80 GB disk space

Hardware (recommended for production):

  • 8+ CPU cores
  • 16+ GB RAM
  • SSD storage
  • Dedicated machine (not shared with other workloads)

Software:

  • Windows Server 2016 or later (2019 or 2022 recommended)
  • .NET Framework 4.7.2 or later
  • TLS 1.2 enabled

Network:

  • Outbound HTTPS (port 443) to Azure service endpoints
  • Access to your on-premises data sources (SQL Server port 1433, Oracle port 1521, etc.)
  • No inbound ports required from the internet

Step-by-Step Installation

1. Create the SHIR in Azure Data Factory

In ADF Studio:

  • Go to the Manage tab
  • Click Integration runtimes then New
  • Select Self-Hosted
  • Give it a name (e.g., "SHIR-Sydney-DC" - include the location for clarity)
  • Copy the authentication keys (you'll need these during installation)

2. Install the SHIR on Your Server

  • Download the latest SHIR installer from the Microsoft Download Centre (or use the link provided in ADF Studio)
  • Run the installer on your Windows server
  • During setup, paste the authentication key from step 1
  • The SHIR registers with your Data Factory and shows as "Running" in ADF Studio

3. Verify the Connection

  • In ADF Studio, check that the SHIR node shows as "Running" under Integration runtimes
  • Create a test linked service to an on-premises data source
  • Click "Test connection" to verify end-to-end connectivity

High Availability Configuration

For production workloads, run at least two SHIR nodes:

  1. Install the SHIR software on a second server
  2. During setup, use the same authentication key as the first node
  3. Both nodes register as part of the same SHIR
  4. Data Factory automatically distributes work across nodes and fails over if one goes down

We strongly recommend high availability for any on-premises integration that supports business-critical reporting or operations. The additional cost of a second server is trivial compared to the cost of a data outage.

Setting Up the On-Premises Data Gateway (Fabric)

Installation

  1. Download the on-premises data gateway from the Microsoft Power Platform site
  2. Install on a Windows server that can reach your on-premises data sources
  3. Sign in with your Microsoft 365 account
  4. Register the gateway with your Fabric tenant
  5. Configure gateway members for high availability (add additional servers to the same gateway cluster)

Configuration in Fabric

  1. In the Fabric admin portal, go to Manage connections and gateways
  2. Verify your gateway appears and is online
  3. When creating connections in Fabric Data Factory, select your gateway as the connectivity option

Connecting to Common On-Premises Data Sources

SQL Server

The most straightforward on-premises connection.

Requirements:

  • SQL Server instance accessible from the SHIR/gateway server
  • SQL authentication or Windows authentication credentials
  • TCP/IP protocol enabled on the SQL Server instance
  • Firewall allowing connection on port 1433 (or your custom port)

Configuration tips:

  • Use SQL authentication for simpler setup. Windows authentication requires additional configuration (the SHIR service account needs access to the SQL Server).
  • Always specify the database name in the connection, not just the server.
  • For named instances, use the format ServerName\InstanceName with the appropriate port.

Common issues:

  • SQL Server Browser service not running (required for named instances)
  • Windows Firewall blocking connections between SHIR server and SQL Server
  • SQL Server not configured to allow remote connections

Oracle

Oracle connectivity requires additional setup.

Requirements:

  • Oracle Client (or Oracle Instant Client) installed on the SHIR/gateway server
  • TNS names file configured, or use Easy Connect strings
  • Appropriate Oracle user with SELECT permissions

Configuration tips:

  • Install the 64-bit Oracle client to match the SHIR
  • Test the Oracle connection from the SHIR server directly (using sqlplus or a similar tool) before configuring Data Factory
  • Use the Oracle Managed Driver connection type for better performance

Common issues:

  • Oracle client architecture mismatch (32-bit vs 64-bit)
  • TNS configuration errors
  • Oracle listener not configured to accept connections from the SHIR server's IP

SAP

SAP connectivity is more complex and deserves its own section. See our guide on Data Factory for SAP integration for detailed instructions.

In brief:

  • For SAP HANA: Use the SAP HANA ODBC driver on the SHIR server
  • For SAP BW: Use the SAP .NET Connector
  • For SAP ECC/S4HANA: Use the SAP Table connector or SAP ODP connector

Network File Shares

Requirements:

  • SHIR service account has read access to the file share
  • Network path accessible from the SHIR server (UNC path like \\fileserver\share)

Configuration tips:

  • Use a service account (not a personal account) for file share access
  • Map the drive on the SHIR server to verify access before configuring Data Factory
  • For large files, ensure sufficient network bandwidth between the SHIR and the file share

AS/400 / IBM DB2

Still common in Australian manufacturing, logistics, and government.

Requirements:

  • IBM Data Server Driver installed on the SHIR server
  • Connection string with host, port, database, and credentials

Networking Scenarios

Simple - Direct Connectivity

The SHIR server sits on the same network as the on-premises data sources. No special networking required beyond outbound HTTPS to Azure.

This is the most common setup for organisations with a single data centre.

VPN or ExpressRoute

The SHIR server runs on an Azure VM, and you connect to on-premises through a site-to-site VPN or ExpressRoute circuit.

When to use this:

  • You want the SHIR managed as Azure infrastructure (easier patching, scaling)
  • You already have ExpressRoute for other Azure workloads
  • Network security requirements mandate that the SHIR doesn't sit on the corporate network

Configuration notes:

  • Ensure the VPN/ExpressRoute circuit has sufficient bandwidth for your data volumes
  • Latency over VPN can affect pipeline performance - test with realistic data volumes
  • The SHIR on the Azure VM connects to on-premises data sources through the VPN, and to Azure Data Factory through the Azure backbone

Private Endpoints

For organisations with strict security requirements, you can configure Data Factory to communicate with the SHIR through Azure Private Link, keeping all traffic on the Microsoft backbone.

When to use this:

  • Regulatory requirements prohibit data traversing the public internet
  • Zero-trust network architecture
  • Financial services and government environments

Configuration:

  • Create a managed virtual network for your Data Factory
  • Set up private endpoints for the data sources
  • Configure the SHIR to use the private endpoint

This adds complexity and cost. We recommend it only when security requirements explicitly demand it.

Performance Optimisation

Sizing the SHIR Server

Under-sizing the SHIR is the most common performance mistake. The SHIR handles data extraction, serialisation, compression, and transfer. For large data volumes, it needs real compute power.

Sizing guidelines:

Daily Data Volume Recommended SHIR Spec
Under 10 GB 4 cores, 8 GB RAM
10-50 GB 8 cores, 16 GB RAM
50-200 GB 16 cores, 32 GB RAM
Over 200 GB Multiple SHIR nodes, 16+ cores each

Tuning Copy Activities

  • Parallelism: Set the degree of copy parallelism in the Copy activity settings. Default is automatic, but you can increase it for throughput.
  • Data Integration Units: For the cloud-to-cloud portion of the transfer, increase DIU allocation for larger datasets.
  • Staging: For some scenarios (especially when writing to Azure Synapse or SQL Data Warehouse), enable staging through Azure Blob Storage for better performance.
  • Compression: Enable compression on the SHIR for data transfer. This reduces network bandwidth requirements at the cost of some CPU.

Batch Size and Timeout

  • Set appropriate batch sizes for database reads (10,000-50,000 rows is typical)
  • Set query timeouts appropriate for your source system (don't use the default 2 minutes for a query that needs 10)
  • Use pagination for REST API sources to avoid timeout issues

Security Best Practices

1. Use Azure Key Vault for credentials. Never store database passwords or connection strings directly in Data Factory linked services. Reference them from Key Vault.

2. Least-privilege access. The database accounts used by Data Factory should have read-only access (SELECT permission only) unless you need to write back to the source.

3. Encrypt data in transit. Enable SSL/TLS for all database connections. This is especially important when data traverses a VPN.

4. Restrict SHIR network access. The SHIR server should only have network access to the data sources it needs and outbound to Azure. No inbound internet access, no unnecessary network routes.

5. Keep the SHIR updated. Microsoft releases monthly updates for the SHIR. Enable auto-update or schedule manual updates at least quarterly.

6. Audit and log. Enable diagnostic logging on the SHIR and send logs to Azure Log Analytics. This gives you visibility into what data is being accessed, when, and by which pipelines.

Troubleshooting Common Issues

Problem Likely Cause Fix
SHIR shows "Offline" in ADF Server restarted, service stopped, or network issue Check the SHIR service is running on the server. Check outbound connectivity to Azure.
"Cannot connect to SQL Server" Firewall, SQL config, or credentials Test connection directly from SHIR server using SSMS. Check firewall rules.
Slow data transfer SHIR under-resourced or network bottleneck Check SHIR CPU/memory usage. Test network bandwidth. Consider scaling SHIR.
"Gateway is unreachable" (Fabric) Gateway service stopped or registration expired Restart the gateway service. Re-register if needed.
Timeout errors Query too slow or timeout too short Optimise the source query. Increase the timeout setting.
"The given key was not present" SHIR version mismatch or corrupt installation Update SHIR to latest version. Reinstall if needed.

How Team 400 Can Help

On-premises connectivity is often the most time-consuming part of a Data Factory implementation. Networking approvals, firewall changes, infrastructure provisioning, and security reviews can add weeks to a project timeline if not managed proactively.

As Microsoft Data Factory consultants, we've set up on-premises integration for dozens of Australian organisations. We know what to ask for up front, how to work with your IT team to get approvals through quickly, and how to troubleshoot the inevitable connectivity issues.

We also help with the broader data platform and reporting layer, so your on-premises data doesn't just get to the cloud - it gets turned into actionable insights.

Get in touch to discuss your on-premises integration requirements, or learn more about our AI and data consulting services.