SLAs

This lesson explores how service level agreements can be enforced for DAGs.

We'll cover the following

Service level agreements

Usually, software services adhere to contractual obligations, called service level agreements or SLAs for short. An SLA captures the quality of the service a consumer can expect. For instance, Amazon Web Services has an SLA for its cloud storage, which states that if the service has an uptime of less than 95% for a given month, the customer will not be charged. The concept of SLAs also exists for Airflow tasks. The SLA in the Airflow context is specified using the sla parameter when instantiating a task. The sla parameter can be set using timedelta. For instance, if we want an email sent when a task isn’t finished within an hour from the DAG’s execution, we can specify sla=timedelta(hour=1) as an argument to the task. So, for a DAG that is set to run @daily, an email would be generated at 1:00 a.m. on Sept. 15th, 2020 (start date) if the task for the DAG run of Sept. 14th, 2020 (execution date) is still in progress past 1:00 a.m. Appropriate email configurations have to be made in order for emails to be sent, which we don’t cover here. Note that the SLA specified at the task level is the time from the beginning of the DAG run execution and not the task execution. If the SLA is violated during task execution, the email is sent only after the task has completed execution.

If we desire to set an SLA across the DAG, we need to set the sla parameter in the default_args parameter that is passed to the DAG constructor. At the end of each task, a check is made to test whether the completed task’s end time exceeded the SLA or if the start time of the next task exceeded the SLA. If so, an email alert is fired.

Bear in mind that SLAs aren’t tracked for manually triggered DAGs with the latest version of Airflow, but that might change in the future, as there’s an open work item for this missing feature. For now, SLAs are only tracked for scheduled tasks, i.e., the ones that have an execution_date in the future from the point the DAG is activated.

Within the web server UI, we can list all the SLA misses under Browse -> SLA Misses. Here is a screenshot:

Consider the Example10 DAG shown in the code widget. It consists of a simple bash task that always generates an SLA miss. The bash task sleeps for a little over a minute while the SLA miss is set for one second. The DAG is scheduled to run every two minutes, and we pass another parameter, catchup=False, to let Airflow know not to create the DAG runs for the past when the DAG is unpaused. Here is the code:

Get hands-on with 1200+ tech skills courses.