If you work in Big Data, you’ve most likely heard of Apache Airflow. It started as an open-source project at Airbnb in 2014 to help the company handle its batch data pipelines. Since then, it has become one of the most popular open-source workflow management platforms within data engineering.
Apache Airflow is written in Python, which enables flexibility and robustness. Its powerful and well-equipped user interface simplifies workflow management tasks, like tracking jobs and configuring the platform. Since it relies on code to define its workflows, users can write what code they want to execute at every step of the process.
Today, we’re going to explore the basics of this popular tool, along with the fundamentals.
Apache Airflow is a robust scheduler for programmatically authoring, scheduling, and monitoring workflows. It’s designed to handle and orchestrate complex data pipelines. It was initially developed to tackle the problems that correspond with long-term cron tasks and substantial scripts, but it has grown to be one of the most powerful data pipeline platforms on the market.
We can describe Airflow as a platform for defining, executing, and monitoring workflows. We can define a workflow as any sequence of steps you take to achieve a specific goal. A common issue occurring in growing Big Data teams is the limited ability to stitch together related jobs in an end-to-end workflow. Before Airflow, there was Oozie, but it came with many limitations, but Airflow has exceeded it for complex workflows.
Airflow is also a code-first platform, designed with the idea that data pipelines are best expressed as code. It was built to be extensible, with available plugins that allow interaction with many common external systems, along with the platform to make your own platforms if you want. It has the capability to run thousands of different tasks per day, streamlining workflow management.
Airflow is used in many industries:
Listed below are some of the differences between Airflow and other workflow management platforms.
Airflow can be used for nearly all batch data pipelines, and there are many different documented use cases, the most common being Big Data related projects. Here are some examples of use cases listed in Airflow’s Github repository:
Now that we’ve discussed the basics of Airflow along with benefits and use cases, let’s dive into the fundamentals of this robust platform.
Workflows are defined using Directed Acyclic Graphs (DAGs), which are composed of tasks to be executed along with their connected dependencies. Each DAG represents a group of tasks you want to run, and they show relationships between tasks in Apache Airflow’s user interface. Let’s break down the acronym:
Note: A DAG defines how to execute the tasks, but doesn’t define what particular tasks do.
A DAG can be specified by instantiating an object of the
airflow.models.dag.DAG, as shown in the below example. The DAG will show in the UI of the web server as “Example1’’ and will run once.
dag = DAG('Example1', schedule_interval='@once', start_date=days_ago(1),)
When a DAG is executed, it’s called a DAG run. Let’s say you have a DAG scheduled to run every hour. Each instantiation of that DAG establishes a DAG run. There can be multiple DAG runs connected to a DAG running at the same time.
Tasks are instantiations of operators and they vary in complexity. You can picture them as units of work that are represented by nodes in a DAG. They portray the work that is done at each step of your workflow with the actual work that they portray being defined by operators.
While DAGs define the workflow, operators define the work. An operator is like a template or class for executing a particular task.
All operators originate from
BaseOperator. There are operators for many general tasks, such as:
These operators are used to specify actions to execute in Python, MySQL, email, or bash.
There are three main types of operators:
Hooks allow Airflow to interface with third-party systems. With hooks, you can connect to outside databases and APIs, such as MySQL, Hive, GCS, and more. They’re like building blocks for operators. No secure information is contained in hooks. It’s stored within Airflow’s encrypted metadata database.
Note: Apache Airflow has community-maintained packages that include the core
hooksfor services such as Google and Amazon. These can be directly installed in your Airflow environment.
Airflow exceeds at defining complex relationships between tasks. Let’s say we want to designate that task
t1 executes before task
t2. There are four different statements we could use to define this exact relationship:
t1 >> t2
t2 << t1
Refresh your Java skills without scrubbing through videos or documentation. Educative’s text-based learning paths are easy to skim and feature live coding environments - making learning quick and efficient.
There are four main components that make up this robust and scalable workflow scheduling platform:
Scheduler: The scheduler monitors all DAGs and their associated tasks. When dependencies for a task are met, the scheduler will initiate the task. It periodically checks active tasks to initiate.
Web server: The web server is Airflow’s user interface. It shows the status of jobs and allows the user to interact with the databases and read log files from remote file stores, like S3, Google Cloud Storage, Microsoft Azure blobs, etc.
Database: The state of the DAGs and their associated tasks are saved in the database to ensure the schedule remembers metadata information. Airflow uses SQLAlchemy and Object Relational Mapping (ORM) to connect to the metadata database. The scheduler examines all of the DAGs and stores pertinent information, like schedule intervals, statistics from each run, and task instances.
Executor: The executor decides how work gets done. There are different types of executors to use for different use cases.
Examples of executors:
SequentialExecutor: This executor can run a single task at any given time. It cannot run tasks in parallel. It’s helpful in testing or debugging situations.
LocalExecutor: This executor enables parallelism and hyperthreading. It’s great for running Airflow on a local machine or a single node.
CeleryExecutor: This executor is the favored way to run a distributed Airflow cluster.
KubernetesExecutor: This executor calls the Kubernetes API to make temporary pods for each of the task instances to run.
So, how does Airflow work?
Airflow examines all the DAGs in the background at a certain period. This period is set using the
processor_poll_interval config and is equal to one second. Once a DAG file is examined, DAG runs are made according to the scheduling parameters. Task instances are instantiated for tasks that need to be performed, and their status is set to
SCHEDULED in the metadata database.
The schedule queries the database, retrieves tasks in the
SCHEDULED state, and distributes them to the executors. Then, the state of the task changes to
QUEUED. Those queued tasks are drawn from the queue by workers who execute them. When this happens, the task status changes to
When a task finishes, the worker will mark it as failed or finished, and then the scheduler updates the final status in the metadata database.
Now that you know the basics of Apache Airflow, you’re ready to get started! A great way to learn this tool is to build something with it. After downloading Airflow, you can design your own project or contribute to an open-source project online.
Some interesting open-source projects:
There’s still so much more to learn about Airflow. Some recommended topics to cover next are:
To get started learning these topics, check out Educative’s course An Introduction to Apache Airflow. This curated, hands-on course covers the building blocks of Apache Airflow, along with more advanced aspects, like XCom, operators and sensors, and working with the UI. You’ll master this highly coveted workflow management platform and earn a valuable certificate by the end.
The skills you gain in this course will give you a leg up on your journey!
Join a community of 500,000 monthly readers. A free, bi-monthly email with a roundup of Educative's top articles and coding tips.