A tool for task orchestration. Those with basic Linux experience are likely familiar with crontab, but it has limitations, such as the inability to establish complex task dependencies and easily review logs. In such cases, a comprehensive ETL (Extract, Transform, Load) tool is needed. This note briefly documents my learning process and shares the results on GitHub Repo.
Additionally, as of now, Airflow has evolved to version 2, and there are still many tutorials online for version 1. When learning, it’s essential to pay attention to the version. Official documentation can be found here.
Features
- Open-source
- User-friendly UI
- Rich plugin ecosystem
- Purely Python-based
Getting Started
For a quick start, you can use the official Docker Compose setup: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html
|
|
Create necessary volumes and set up the Airflow executor:
|
|
DAG
DAG (Directed Acyclic Graph). In Airflow, a DAG is a definition of a workflow, describing a series of tasks and their dependencies. Each task represents a unit of work that can be any operation executable in Airflow, such as running a Python script, executing an SQL query, or invoking an external API.
Example
|
|
Scheduler
The scheduler checks the DAGs folder at regular intervals:
- Checks for any DAGs requiring a DAG Run.
- Creates scheduled task instances for tasks under DAG Run.
To create a task, place the DAG Python file in the DAGs folder. You can copy and modify examples from the Airflow official documentation or refer to the example in the previous section.
UI Operations
Once the scheduler has completed the update, we can see our newly added DAGs on the UI. Due to platform constraints, detailed UI operations with numerous images are not suitable here, so instead, here is the official UI documentation link.
Focus on mastering the basics:
- View DAG operation status
- Manually trigger DAG runs
- Review DAG execution logs