About Azure ADF

The Azure Data Factory service is a fully managed service for composing data storage, processing and movement services into streamlined, scalable, and reliable data production pipelines. Copy Activity enables copying of data from a source to sink (destination) in a batch. The Copy Activity can be used for migrating data from on premise to Azure (or) Azure to on premise (or) Azure to Azure.

Problem Statement

While ADF (Azure data Factory) is a great tool to perform ETL activities effectively, it comes with its own set of limitations. Primary one being its inability to support parameterization. Parameters are essential to understand and load only the data that has changed since the last successful data load instead of performing a full load operation. While dealing with millions of rows of data this limitation to perform incremental load results in serious inefficiency to the whole ETL process.

We believe Microsoft is going to fix this limitation in the upcoming version of ADF but we have built an effective stop gap solution till that happens. We have also made sure the solution is configurable and scales for each new client and each new data source/target sets, thus reducing the overhead for our customers in building, configuring and maintaining ETL activities.

Our Proposed Solution

With the proposed solution we plan to address the following issues:

  1. Ability to add/manage clients (source/target systems) dynamically
  2. Ability to Pass Parameters to ADF to load data incrementally
  3. Drive/ Manage Multiple Jobs in Azure
  4. Edit Jobs dynamically

We have used PowerShell script to build our solution

Solution Overview

Our approach can be depicted in the following diagram

Technical Details

We will use the following PowerShell commands for our solution.

Note: Generic naming convention is used for linked services, datasets and pipelines.

  • Get the LastRunDate for each Client from Target table and update that in configuration table
  • Get the list of clients and respective details whose data need to be copied to cloud
$SqlCmd.CommandText = "SELECT ID, ServerName,
DatabaseName, UserName ,SecureId,StartTime,
SQLQuery, Client From ServerDetails"

  • Use ForEach loop mechanism within powershell to go through the list of clients
  • Create the Linked Service if it is not available
$IsOutputDataSetExist = (Get-AzureRmDataFactoryDataset 
-ResourceGroupName $resource -DataFactoryName 
$datafactory -Name $outputdatasetName)
if (([string]::IsNullOrEmpty($IsOutputDataSetExist))) { 
New-AzureRmDataFactoryDataset -ResourceGroupName $resource 
-DataFactoryName $datafactory -File 
$FinaloutputdatasetJSONFileName -Force 
}
  • Similarly, create Input and Output datasets if they are not available

Input dataset

$IsInputDataSetExist =  Get-AzureRmDataFactoryDataset  -ResourceGroupName 
$resource -DataFactoryName $datafactory -Name $inputdatasetName
if (([string]::IsNullOrEmpty($IsInputDataSetExist))) {
New-AzureRmDataFactoryDataset -ResourceGroupName $resource 
-DataFactoryName $datafactory  
-File $FinalInputdatasetJSONFileName -Force
}

Output dataset

$IsOutputDataSetExist =  Get-AzureRmDataFactoryDataset  
-ResourceGroupName $resource -DataFactoryName 
$datafactory -Name $outputdatasetName
if (([string]::IsNullOrEmpty($IsOutputDataSetExist)))  {
New-AzureRmDataFactoryDataset -ResourceGroupName 
$resource -DataFactoryName $datafactory  
-File $FinaloutputdatasetJSONFileName -Force 
}
  • Get the LastRunDate from configuration table for each client
$SqlCmd.CommandText = "SELECT LastRunDate  LastRunDate 
FROM ClientConfigurations WHERE Client ="+$ClientName
  • Create/update the pipeline with one-time run configuration
New-AzureRmDataFactoryPipeline -ResourceGroupName 
$resource -DataFactoryName $datafactory 
-File $FinalPipelineJSONFileName  -Force

Solution Prerequisites

In  order to efficiently use the solution, we need to have the following prerequisites in place.

  • ADF (incl Data Management Gateway – for on-prem source databases) as a solution for ETL
  • Azure PowerShell cmdlets package must be installed in local machine
  • Azure Active Directory Account
  • List of On-Premise server details whose data should be copied to cloud

Conclusion

This custom solution provides our customers the ability to manage multiple source/target systems as a configurable solution and thereby help them avoid ETL overheads. Solution further helps us perform incremental data loads instead of full loads to avoid inefficiencies. Solution also automates the process of passing parameters to Pipeline with an ability to track the status of each job. If these are some of the challenges that you are facing, then this solution would perfectly fit in your scheme of things. Do talk to us for any of your challenges around Azure Data Factory, we would be really glad to help.