Published on

Airflow 刚提交的dag在达到临近的第一次执行时间时为什么没有执行

Authors
  • avatar
    Name
    Shelton Ma
    Twitter

1. 刚提交的dag在达到临近的第一次执行时间时为什么没有执行?

参考: Scheduling & Triggers

Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.

Airflow 调度器监控所有的任务和dag, 并在条件满足时触发任务实例, 生效Dag后, 调度器将会根据start_date创建任务实例, 生成下一次的执行计划.

比如: 有一个每周二执行的任务, 1 8 * * * 2, 如果2023-05-17(周三)生效了dag, 并且start_date=2023-05-17, 那么将会在2023-05-23调度器调度生成任务, 此任务在2023-05-30执行. 所以至少要设置start_date=2023-05-15, 为了避免频繁计算, 安全设置方式: start_date=now - datetime.timedelta(schedule_interval)

2. airflow 启动dag后, 为什么补跑了很多历史任务?

  1. catchup=True时, 文档提到在关闭的dag重新开启, 还会执行之前的任务, 这通常不符合我们的需求, 并且有些时候是危险的, 所以需要特别关注并关闭配置

    A key capability of Airflow is that these DAG Runs are atomic, idempotent items, and the scheduler, by default, will examine the lifetime of the DAG (from start to end/now, one interval at a time) and kick off a DAG Run for any interval that has not been run (or has been cleared). This concept is called Catchup. So, Catchup is also triggered when you turn off a DAG for a specified period and then re-enable it.

  2. 配置说明

        dag = DAG(
            dag_id=DAG_ID,
            default_args=default_args,
            schedule_interval=CRON,
            start_date=utc_logic_start_date,
            tags=[PROJECT],
            catchup=False,
        )
    

结合以上两点, 确保airflow禁用Catchup后, start_date可以设置的比较靠前