Scheduling

This lesson clarifies the working of schedule_interval and start_date, which can be confusing for complex crontab expressions.

When initially working with Airflow, it is common to get confused with how all the scheduling parameters work together. In this lesson, we’ll explore the differences between the various parameters.

When creating a DAG, we can specify the start_date and a schedule_interval as parameters to the constructor. Let’s see an example:

dag = DAG(
    'Example9',
    default_args=default_args,
    description='Example DAG 9',
    schedule_interval='@daily',
    start_date=datetime(2020, 9, 5))

In the Example9 DAG, we set start_date to 5th Sept 2020, and the schedule_interval is set to @daily. Note that @daily is an alias for the 0 0 * * * crontab expression. There are other aliases for commonly used schedules, such as @weekly, @monthly, and @yearly, which all translate to crontab expressions under the hood. You can provide a crontab expression for the schedule_interval parameter for complex schedules. A good resource to work with crontab expressions is crontab.guru. Remember that Airflow works with UTC by default but can be configured to work with your local time too. Airflow will also schedule DAG runs for the previous days even though we are running the DAG now. Each DAG run will be associated with an execution date and a start date. The execution date is the date that the DAG should have run, and the start date is when Airflow actually runs it. Please don’t confuse the start_date that we pass into the DAG constructor with the start date associated with a DAG run; both are distinct.

The combination of start_date and schedule_interval implies that the DAG Example9 should have run starting from 5th Sept 2020 up until now. Let’s further assume that now is 9th Sept 2020 i.e., you are running the DAG on 9th of Sept 2020. Airflow is smart enough to run the missing DAGs for 5th, 6th, 7th, and 8th of Sept for you too. All these four runs will have a start date of Sept. 9th (since all of them were kicked off on Sept. 9th) but will all have different timestamps on the 9th. Additionally, the execution dates for these DAGs will span from Sept. 5th to Sept. 8th.

As you can see from the screenshot above, Airflow created DAG runs for each of the prior days. The execution date for each DAG run is distinct, but the start dates for all of the runs is the 9th of Sept. Sure, the hours, minutes, etc., differ because they are run at different times on the 9th of Sept. The astute reader would observe that the run for the 9th of Sept. is missing from the list in the screenshot above. This is because Airflow has its roots in ETL, which involves running batch jobs at the end of the day for that day. For instance, the data for the 4th of July will be collected for the entire day, and the ETL job for the 4th of July will actually run at 12 a.m. on the 5th of July. This makes sense because the job for a day should wait to have all the data for that day before running. In our example, the DAG run for the 9th of Sept. isn’t executed until 12 a.m. on the 10th of Sept., which is why it doesn’t show up in the listing in the screenshot. Thinking on the same lines, if the schedule for a DAG is 3 p.m. every day, then the DAG run for the previous day will run immediately after 3 p.m. the next day (and not 12 a.m.). There will be a full 24 hour delay before the DAG run for the previous day executes.

It is interesting to consider what happens if you set the schedule_interval to run every Monday and Friday. Let’s say we set the start_date for the DAG as August 15th, 2020.

Day Date DAG Execution Date DAG Start Date
Sunday Aug. 15th, 2020 - -
Monday Aug. 16th, 2020 - -
Tuesday Aug. 17th, 2020 - -
Wednesday Aug. 18th, 2020 - -
Thursday Aug. 19th, 2020 - -
Friday Aug. 20th, 2020 Aug. 16th, 2020 Aug. 20th, 2020
Saturday Aug. 21st, 2020 - -
Sunday Aug. 22nd, 2020 - -
Monday Aug. 23rd, 2020 Aug. 20th, 2020 Aug. 23rd, 2020
Tuesday Aug. 24th, 2020 - -
Wednesday Aug. 25th, 2020 - -
Thursday Aug. 26th, 2020 - -
Friday Aug. 27th, 2020 Aug. 20th, 2020 Aug. 27th, 2020
Saturday Aug. 28th, 2020 - -

The DAG’s start date is set to Aug. 15th, 2020 (not to be confused with a DAG run’s start date, as shown in the table above). The first DAG run is for Aug. 16th, 2020, but it’ll start on Friday, Aug. 20th, 2020! This is an idiosyncrasy of Airflow that can confuse seasoned engineers. The DAG runs for Aug. 16th, 2020 actually runs on Aug. 20th, 2020; the execution date for the DAG is Aug. 16th, 2020, but the start date is Aug. 20th, 2020. The DAG run with the execution date of Aug. 20th, 2020 will have a start date of next Monday, i.e., Aug. 23rd, 2020, and so on and so forth.

Finally, you may note that a DAG doesn’t execute at exactly the time it is supposed to run, e.g., a DAG run may start at 10:01 p.m. when its schedule asks it to run at 10:00 p.m. There may be a delay of a few seconds or so. There’s a configuration parameter, scheduler_heartbeat_sec, defined in airflow.cfg that controls how often the Airflow scheduler runs. The scheduler runs and looks for tasks to trigger, and there may be a delay in when a task becomes due and when the scheduler is able to run it. Making the scheduler run at a higher frequency can put pressure on the database, so any tweaks should be done cautiously.

Get hands-on with 1200+ tech skills courses.