Walkthrough of CICD in Azure Data Factory (ADF)

Important concepts & Best Practices

One of the key requirements of enterprise projects using Azure Data Factory (ADF) is to ensure version control, control release & automated deployment of ADF pipelines from the lower environment(Dev) to the Higher environment (UAT, Prod).

In this blog, I will be discussing CI/CD strategy for ADF — DevOps to ensure automated seamless deployment across various environments.

Continuous Integration & Delivery (CICD) in Azure Data Factory (ADF)

Continuous Integration (CI) is a development practice of merging a developer’s work/code into a shared repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect problems early. Continuous delivery(CD) picks up where continuous integration ends. CD automates the delivery of applications to selected infrastructure environments.

Continuous Integration (CI) in Azure Data Factory

Azure Data Factory utilizes Azure Resource Manager (ARM) templates to store the configuration of your various ADF entities (pipelines, datasets, data flows, and so on). Azure Data Factory allows you to configure a Git repository with either Azure Repos or GitHub in an out of the box fashion.

Each Azure Repos associated with a data factory has a “collaboration branch” that is used for publishing. “main” is the default collaboration branch & this can be changed to publish from different branch.

Each developer creates a feature branch to make changes. After a developer is satisfied with their changes , they create a pull request from their feature branch to get their changes reviewed by peers. Once a pull request is approved and changes are merged in the collaboration (main) branch.

On manually publishing (by clicking publish in ADF UI) from the main branch, Resource management templates are generated & pushed to “adf_publish” branch. It’s recommended to not allow direct check-ins to the collaboration branch so only a set of developers/team lead should be allowed to publish to the data factory.

For organizations using other source control like GitLab, Bitbucket can either check-in the resource management template manually in their source control by exporting from ADF UI & use the continuous deployment of Azure Devops pipelines to perform automated deployments to higher environments.

Continuous delivery(CD) in Azure Data Factory

Below methods are used to promote a data factory to higher environments:

  • Manually upload a Resource Manager template using Data Factory UX integration with Azure Resource Manager.

In this approach, the Resource Manager Template is manually exported from development environment using “Export ARM Template”.

The exported template is imported in higher environments(UAT,PROD) using “Import ARM Template”. Please refer link for detailed steps https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment#manually-promote-a-resource-manager-template-for-each-environment

This approach is good if you have small number of pipelines, publish cadence is very low (once in a year) or need selective publishing in a certain hotfix production scenarios where in the bug needs to be fixed right away rather than going through the regular approval process.

  • Automated deployment using Data Factory’s integration with Azure Pipelines

In this approach, Azure Pipelines release is used to automate the deployment of a data factory to multiple environments.

As a part of release management, Resource Management artifacts are pulled, a release definition is created & deployed either using a manual or an automated trigger & deployed to different environment by ensuring that environment specific parameters are replaced & deployed.

It is recommended to store the connection information for linked services in Azure Key Vault. It is also recommended to keep separate key vaults with same secret names for different environments as don’t need to parameterize each connection string across CI/CD environments because the only thing that changes is the key vault name, which is a separate parameter.

By design, Data Factory doesn’t allow cherry-picking of commits or selective publishing of resources from ADF UI. Publishes will include all changes made in the data factory since selective publishing of a subset of resources could lead to unexpected behaviors and errors. It is recommended to ensure only those resources which needs to be deployed to higher environments are pushed to collaboration branch & published. In cases where selective cherry-picking of commits needs to be performed, use Azure Devops cherry picking feature.

I would like to thank Ameet for co-authoring the blog & special thanks to Abhishek Narain & Damle Amit for reviewing the content.

I hope this blog helped you in understanding the basics of CICD in Azure Data Factory along with recommended best practices.

We have a separate blog with a sample pipeline along with detailed steps covering the same to understand better. Happy Learning!!!

Please Note — All opinions expressed here are my personal views and not of my employer.

Architect@Microsoft, Technology Evangelist, Sports Enthusiast! All opinions here are my personal thoughts and not my employers.