This device is not compatible.
You will learn to:
Design and implement scalable data collection workflows with Airflow.
Integrate diverse data sources into Airflow pipelines.
Understand Apache Airflow fundamentals for data pipeline orchestration.
Build end-to-end data pipelines with Airflow.
Store and organize data in a data lake using Airflow DAGs.
Schedule and manage daily data pipelines using Airflow.
Skills
Data Collection
Data Cleaning
Data Engineering
Data Pipeline Engineering
Task Automation
Prerequisites
Proficiency in Python programming language
Understanding of ETL processes and data pipeline concepts
Fundamentals of data pipelines
Familiarity with Airflow
Technologies
Python
Pandas
Apache Airflow
Project Description
Data collection from multiple sources requires automation, scheduling, and reliability to avoid manual errors and ensure consistent updates. Apache Airflow is the industry-standard platform for orchestrating ETL pipelines (extract, transform, load), enabling data teams to schedule workflows, manage dependencies, and monitor execution across cloud environments. Mastering Airflow is essential for data engineers building scalable data pipelines that integrate with databases, APIs, and data lakes.
In this project, we'll build an automated ETL pipeline using Python, Pandas, and Apache Airflow that collects data from multiple sources, stores it in a data lake structure, and organizes it into scheduled Airflow DAGs (Directed Acyclic Graphs). We'll handle two data collection patterns: snapshot data captured at specific points in time, and time-based data collected continuously. For each pattern, we'll implement the complete ETL process: extracting raw data, saving it to a raw folder, transforming and cleaning it with Pandas, and transferring refined data to production storage.
We'll then automate these workflows by building Airflow DAGs with task dependencies, adding DAG parameters for flexibility, and implementing missing data detection and backfilling for stock data gaps. We'll optimize performance, configure Airflow variables for dynamic settings, and set up access control for DAG management. By the end, you'll have a production-ready data pipeline demonstrating Apache Airflow orchestration, ETL workflow automation, Pandas data transformation, DAG scheduling, and pipeline monitoring applicable to any data engineering or data warehousing project.
Project Tasks
1
Introduction
Task 0: Get Started
2
ETL of the Snapshot Data
Task 1: Collect the Snapshot Data
Task 2: Save the Data in the Raw Folder
Task 3: Transfer the Data to the Refined Folder
3
ETL of the Time-Based Data
Task 4: Collect the Time-Based Data
Task 5: Save the Data to the Raw Folder
Task 6: Transfer the Data to the Refined Folder
4
Leverage Your Solution with Airflow
Task 7: Sign into Airflow
Task 8: Build the First DAG
Task 9: Add the Snapshot Data to DAG
Task 10: Add Parameters to DAG
Task 11: Optimize the Snapshot Data Collection
Task 12: Add the Time-Based Data to DAG
Task 13: Identify the Missing Dates of the Stock
Task 14: Fill the Missing Stock Data
5
Advanced Configurations
Task 15: Add Variables
Task 16: Control Access to the DAGs
Congratulations!
Subscribe to project updates
Atabek BEKENOV
Senior Software Engineer
Pradip Pariyar
Senior Software Engineer
Renzo Scriber
Senior Software Engineer
Vasiliki Nikolaidi
Senior Software Engineer
Juan Carlos Valerio Arrieta
Senior Software Engineer
Relevant Courses
Use the following content to review prerequisites or explore specific concepts in detail.