ADF Studio Designing Data Pipelines 1: Data Copy

Explore the data pipeline design using the Azure portal.

This lesson focuses on designing data pipelines using Azure Portal UI. We’ll delve into tools and functionalities like the intuitive drag-and-drop interface, interactive canvas, and comprehensive toolbox provided by the Azure Portal that enable seamless pipeline design.

Designing data pipelines in ADF

Designing data pipelines in ADF involves creating a workflow that defines how data moves from source systems to target systems. A pipeline consists of activities that represent a processing step, such as copying data from one location to another, transforming data, or running a custom activity. The pipeline also includes data flow activities that define the structure of the data as it moves through the pipeline.

This includes selecting the right data sources and destinations, choosing the appropriate data integration technologies, defining the data transformation logic, and optimizing pipeline performance. ADF provides a wide range of tools and features for designing data pipelines, including a visual designer, a code editor, and integration with other Azure services such as Azure Functions, Azure Databricks, and Azure Stream Analytics.

Data movement activities in ADF

Data movement activities in ADF are used to copy and transform data between different data sources and destinations. These activities enable the movement of data across various on-premises and cloud-based data stores, including SQL Server, Oracle, MySQL, PostgreSQL, Azure SQL Database, Azure Blob Storage, Azure Data Lake Storage, and more. Active loading activities are used to copy data in real-time or near real-time. It is useful for scenarios where data needs to be processed as soon as it is available, such as in streaming data scenarios.

Azure allows for incremental and bulk data copy using data factory and this documentation explains steps for setting that architecture up.

Types of data copy functionality

  • Copy data activity: Copy data activity is the primary data movement activity in Azure Data Factory. It is a serverless data copy solution that can copy data from various sources to various destinations. The Copy data activity can replicate data from file-based systems like Azure Blob Storage, Azure Data Lake Storage, FTP, and database-based systems like SQL Server, Oracle, MySQL, and PostgreSQL.

  • Data Flow: Azure Data Factory Data Flow is a cloud-native data transformation service that lets users visually design, build, debug, and execute data transformations at scale. It is used when users need to transform data during the copy process. Data flow is also available for mapping data from a source to a sink in ADF pipelines, similar to the mapping data flow feature in Azure Synapse Analytics.

  • Bulk copy: This is a high-speed data copy operation that efficiently copies large amounts of data. It is available for copying data from SQL Server, Oracle, and MySQL sources. In bulk mode, a bulk copy can also copy data to a SQL Server destination.

  • Incremental copy: Incremental copy is a data copy operation that only copies data that has changed since the last copy operation. It is useful when copying data from systems that have a large amount of data that is changing frequently, such as transactional systems. Incremental copy is available for file-based and database-based systems.

  • Data Migration Assistant (DMA): This is a tool that can help users migrate databases from on-premises or other cloud platforms to Azure SQL Database, Azure SQL Managed Instance, or SQL Server on Azure Virtual Machines. It can also be used to copy data from one database to another, both on-premises and in the cloud.

Running a data copy pipeline

In this section, we’ll build a data pipeline for performing a source-to-destination copy activity in ADF. For this copy activity, we will use the synthetic-pii-data.csv data file, which is available in the course contents.

Setting up source data

  1. Log in to the Azure portal.

  2. In the top search navigation, type “Storage” and select “Storage accounts."

  3. Now, click the “+Create” to create a new storage account.

  4. Note that the new chosen for the storage account should be globally unique. In this example, the storage account used is named “edu123storage."

  5. In the storage account on the left menu lane, select “Containers."

  6. Then click the “+Container” button to create a new storage container. This example uses a container named “edu-az-storage."

The reason for creating a container inside the storage account is that raw data files cannot directly be linked to a storage account. Therefore, containers are part of the storage layer that stores the data files. The structure of storage is like this: Azure account > resource group > storage account > storage container > raw data files.

  1. Now inside the container, click "Upload" and upload the .csv file downloaded from the above synthetic dataset. This will bring the data file into the Azure ecosystem.

The images below give reference to the steps performed above to upload our raw data file into an Azure storage container:

Get hands-on with 1400+ tech skills courses.