Azure Data Factory Best Practices

Amit Damle
5 min readJul 13, 2021

--

Content Contributors — RK Iyer

Hello, here is a quick Best Practice checklist to help you build effectively using Azure Data Factory. Based on our experience working with various customers, we realized that there is no consolidated list of ADF best practices to help new customers effectively build Data Pipelines. We tried addressing across 5 categories e.g. Dev, Performance, Security & Disaster Recovery

Development —

6 Important Learning from Mistakes

#1. Naming convention —

Most of the time new users do not follow naming conventions and keep the default ADF artefact names which cause confusion at later point in time.

Refer this for naming rules

Sample rules

#2. Creating one pipeline per Table / Object —

New users always make mistake of creating a one pipeline per Table or Object for extracting data. This leads to cluttered unmanageable ADF workspace. Instead check if you can group the tables / files / Objects and create a single pipeline for data extraction. This will reduce number of pipelines to be managed. Dynamic pipeline concept will help in creating extraction and processing pipeline for multiple objects.

Dynamic Pipeline Sample

Note: Though above sample shows copying multiple files into Synapse same concept can be used with any storages supported by ADF

#3. Incremental Copy Dilemma —

New Users often either unaware about the ADF’s incremental copy capabilities or get confused during implementation.

Here is a simple guideline –

1. If your tables have a column to uniquely identify rows or a timestamp column then you can make use of ADF’s incremental copy logic.

2. If you need to perform complex logic to identify incremental dataset then it is better to use specialized tools e.g. Informatica, Goldengate or StreamSets

Copy Multiple Tables

Incremental Copy Pipeline using Template

Using SQL Server Change Tracking

#4. Missing Application Logging & Auditing —

Most cases users forget to handle the application logging and auditing assuming ADF will capture those finer details. Since ADF is a platform, it cannot capture the specific application details on success or failure instead it provides a way for you to do so

Error Handling –

As you can see in the diagram ADF activities allows actions to be taken based on status e.g. Success / Failure / Completion / Skipped. For more information, please refer here

Sending Email Notification

Monitoring —

Using Azure Monitor & ADF Analytics

Inbuilt / Custom Monitoring — https://www.youtube.com/watch?v=zyqf8e-6u4w

#5. No provision for Rerunning the Pipelines —

As a new ETL user it is easy to get stuck while implementing the reprocessing workflow. Reprocessing may need a simple logic or a complex logic depending on the scenario. ADF provides an option to trigger rerun of the entire pipeline or specific activities within pipeline. Please refer here for more information. If your scenario demands a complex logic then custom logic can be built using the control table used in #2

#6. Lack of Clarity on Source Control —

Most of the users do not make use of ADF’s capability to connect to the version control system to store their artifacts shared among multiple members of team. This causes a accidental updates or deletion of the Pipelines marked for release. Suggestion is to make use of the ADF’s version control capabilities and setup a robust continuous development process with strict review and release cycles via Azure DevOps. Please refer here for CI/CD using Azure DevOps for ADF

Performance —

Azure Data Factory is a managed service i.e. the compute required for data movement and processing can be scaled based on need (for Azure IR). If you want to run your pipelines / activities in parallel, then design your pipeline to make use of ‘For Each Activity’ that can execute max 50 inner activities simultaneously.

Perf tuning is an iterative process that requires change in parameters till you get the desired performance. While using Azure IR you can set the value of DIUs that represents the core, memory and n/w resource allocated.

If you want to tune your copy activity that uses Azure Integration runtime then please refer here for performance tuning steps

If using Self-hosted integration runtime then refer this tuning guide

Please refer here for Mapping Data Flow Performance Tuning

Security —

Authentication and Authorization —

Authentication — users accessing ADF workspace are authenticated using their AD credentials. Based on their allocated roles they will be either allowed to create / make changes to pipelines

Please refer this link for details on ADF roles

Securing Data Store Credentials –

Data Factory supports Managed Service Identity (MSI) which relives users from creating and managing the service principal. In case the storage is not supported by ADF for MSI then make sure to store the storage credentials in Azure KeyVault.

Please refer more info here

Encryption during Transit

If storage supports HTTPS and TLS, then all data will be transferred on secured channel. TLS 1.2 is used by default. Transferring data from on-premise source or from other cloud make sure you have Site-2-Site VPN configured.

Please refer more info here

Secure SHIR on Azure –

Management of SHIR installed on Azure VM is a responsibility user.

Patching of VMs managed by Azure

Install malware defense on VM and make sure it is regularly updated

VM should be part of your Virtual Network

Make sure it has only out bound access to the internet

Only few authorized users have access to SHIR VM, if possible, adopt just in time access policy

Network Security –

If users want their storage to be accessed from within their own Virtual Network on Azure then Data Factory Self Hosted Integration runtime on Azure needs to be installed on Virtual Machine(VM).

Using Private Link —

Using private link users can connect to various Azure PaaS services over private IP addresses. For more details and limitations pls refer here

Disaster Recovery

At present ADF only provides data redundancy see here. In case of regional datacenter loss ADF is restored in paired region. This is controlled by Microsoft. If customer-controlled DR is required, then secondary ADF need to be setup in different region and meta-data can be replicated using Azure DevOps release pipeline

Disclaimer: Ideas / views expressed are personal opinions

--

--

Responses (1)