Architecture

This lesson details the components that make up Airflow.

In this lesson, we’ll explore the architecture and components that make up Airflow. It primarily consists of the following entities:

  1. Scheduler
  2. Web server
  3. Database
  4. Executor

Here is a pictorial representation of the architecture:



Big picture Airflow architecture

These four pieces work together to make up a robust and scalable workflow scheduling platform. We’ll discuss them in detail below.

Scheduler

The scheduler is responsible for monitoring all DAGs and the tasks within them. When dependencies for a task are met, the scheduler triggers the task. Under the hood, the scheduler periodically inspects active tasks to trigger.

Web server

The web server is Airflow’s UI. It displays the status of the jobs and allows the user to interact with the databases as well as read log files from a remote file store, such as S3, Google Cloud Storage, Azure blobs, etc.

Executor

The executor determines how the work gets done. There are different kinds of executors that can be plugged in for different behaviors and use cases. The SequentialExecutor is the default executor that runs a single task at any given time and is incapable of running tasks in parallel. It is useful for a test environment or when debugging deeper Airflow bugs. Other examples of executors include CeleryExecutor and LocalExecutor.

  • The LocalExecutor supports parallelism and hyperthreading and is a good fit for running Airflow on a local machine or a single node. The Airflow installation for this course uses the LocalExecutor.

  • The CeleryExecutor is the preferred method to run a distributed Airflow cluster. It requires Redis, RabbitMq, or another message queue system to coordinate tasks between workers.

  • The KubernetesExecutor calls the Kubernetes API to create a temporary pod for each task instance to run. Users can pass in custom configurations for each of their tasks.

Database

The state of the DAGs and their constituent tasks needs to be saved in a database so that the scheduler remembers metadata information, such as the last run of a task, and the web server can retrieve this information for an end user. Airflow uses SQLAlchemy and Object Relational Mapping (ORM), written in Python, to connect to the metadata database. Any database supported by SQLAlchemy can be used to store all the Airflow metadata. Configurations, connections, user information, roles, policies, and even key-value pair variables are stored in the metadata database. The scheduler parses all the DAGs and stores relevant metadata, such as schedule intervals, statistics from each run, and their task instances.

Working

  1. Airflow parses all the DAGs in the background at a specific period. The default period is set using the processor_poll_interval config, which is, by default, equal to one second.

  2. Once a DAG file is parsed, DAG runs are created based on the scheduling parameters. Task instances are instantiated for tasks that need to be executed, and their status is set to SCHEDULED in the metadata database. Since the parsing takes place periodically, any top-level code, i.e., code written in global scope in a DAG file, will execute when the scheduler parses it. This slows down the scheduler’s DAG parsing, resulting in increased usage of memory and CPU. Therefore, caution is recommended when writing code in the global scope.

  3. The scheduler is responsible for querying the database, retrieving the tasks in the SCHEDULED state, and distributing them to the executors. The state for the task is changed to QUEUED.

  4. The QUEUED tasks are drained from the queue by the workers and executed. The task status is changed to RUNNING.

  5. When a task finishes, the worker running it marks it either failed or finished. The scheduler then updates the final status in the metadata database.

airflow.cfg

Airflow comes with lots of knobs and levers that can be tweaked to extract the desired performance from an Airflow cluster. For instance, the processor_poll_interval config value, as previously discussed, can change the frequency which Airflow uses to parse all the DAGs in the background. Another config value, scheduler_heartbeat_sec, controls how frequently the Airflow scheduler should attempt to look for new tasks to run. If set to fewer heartbeat seconds, the Airflow scheduler will check more frequently to trigger any new tasks, placing more pressure on the metadata database. Yet another config, job_heartbeat_sec, determines the frequency with which task instances listen for external kill signals, e.g., when using the CLI or the UI to clear a task. You can find a detailed configuration reference here.