Building Blocks

This lesson explains the fundamentals of Airflow, including DAG, DAG runs, tasks, and task instances.



Relationship between DAGs, DAG runs, tasks and task instances


Directed Acyclic Graph (DAG)

Workflows are defined using DAGs, short for Directed Acyclic Graphs, that are composed of tasks to be executed and their associated dependencies. For instance, the relationship among three tasks—A, B, and C—can be expressed using a DAG. We can say, “Execute B only after A has executed, but C can be independently executed at any time.” We can also specify other constraints, such as timeout for a task, number of retries to perform for a failing task, when to start a task, etc. Note: a DAG defines how to execute the tasks (constraints and dependencies), but it doesn’t say what a particular task will do. A DAG can be specified by instantiating an object of the airflow.models.dag.DAG class as follows:

dag = DAG('Example1',
          schedule_interval='@once',
          start_date=days_ago(1),)

The DAG above will appear in the web server UI as “Example1” and will run once.

Operator

The DAG defines the workflow, and the operators define the work. An operator is a template or class for performing a specific task. When instantiated in Python code, an operator is called a task. All operators are derived from BaseOperator and acquire much of their functionality through inheritance. Operators exist for several common tasks such as:

  • BashOperator
  • EmailOperator
  • PythonOperator
  • MySqlOperator

As the names imply, the operators above can be used to define actions to perform in bash, Python, email, or MySQL.

There are three main types of operators, as described here:

  1. Operators that perform an action or request another system to perform an action.

  2. Operators that transfer data from one system to another.

  3. Operators that run until a certain condition or criteria are met, e.g., a particular file lands in HDFS or S3, a Hive partition gets created, or a specific time of the day is reached. These special kinds of operators are also known as sensors and can allow part of your DAG to wait on some external system. All sensor operators derive from the BaseSensorOperator class.

Task

A task is an instantiation of an operator and can be thought of as a unit of work and is represented as a node in a DAG. A task can be as trivial as executing a bash date command or as complex as running a remote job on a Hadoop cluster.

Task instance

A task instance represents an actual run of a task. Task instances belong to DAG runs, have an associated execution_date, and are instantiable, runnable entities. Task instances go through various states, such as “running,” “success,” “failed,” “skipped,” “retry,” etc. Each task instance (and task) has a life cycle through which it moves from one state to another.

DAG run

A DAG, when executed, is called a DAG run. If you have a DAG that is scheduled to run every hour, then each instantiation of the DAG constitutes a DAG run. There could be multiple DAG runs associated with a DAG running at the same time.

Execution date

Say you create a DAG and want to run it immediately. The physical date for the DAG run is now, but the DAG is also associated with a logical date of execution, which can be, confusingly, in the past. Airflow was developed as a solution for ETL (Extract Transform Load) needs. In the ETL world, one typically summarizes data. So, if you want to summarize data for 2020-02-19, you would do it at midnight GMT on 2020-02-20, which would be right after all data for 2020-02-19 becomes available. The DAG run performing the ETL would have an execution date of 2020-02-19 even though the actual run is made on 2020-02-20. We’ll examine this concept further when working with examples.

Relationships

Airflow excels at defining complex relationships between different tasks. For instance, we can specify that task t1 occurs before task t2 as follows:

t2.set_upstream(t1)

Or equivalently,

t1.set_downstream(t2)

With the same relationship but different syntax,

t1 >> t2

Or equivalently,

t2 << t1

All of the above four statements define the same relationship between tasks t1 and t2, i.e., t2 executes only after t1 has executed.