APACHE AIRFLOW · TECH

Apache Airflow: Python DAGs as the data-engineering default since 2014, May 2026 v3.x

Apache Airflow is the standard platform for data pipelines with Python DAGs, Apache 2.0, self-hostable or via Astronomer/MWAA as managed service.

Researched & fact-checked by: DuneDive LLC · As of: 2026-05

What is Apache Airflow?

Apache Airflow is the de-facto standard platform for data pipelines and workflow orchestration in data engineering. Started 2014 at Airbnb by Maxime Beauchemin, donated to the Apache Software Foundation in 2016, top-level Apache project since 2019. As of May 2026, Airflow 3.x is in production – a major architecture rework (task-execution isolation, new scheduler engine, better UI) released end of 2025.

The core concept is the DAG (Directed Acyclic Graph): a Python script defines tasks and their dependencies. On trigger (cron or event-based), the DAG is instantiated by the scheduler and tasks are executed in the right order. Every task is an operator – Python operator (own code), Bash operator, SQL operator (Postgres, MySQL, Snowflake, BigQuery), Kubernetes pod operator (run containers), sensors (wait for external events).

The architecture has several components: webserver (UI), scheduler (decides which tasks run), executor (runs tasks – several implementations: Local, Celery, Kubernetes), worker (with Celery or Kubernetes executor), metadata database (Postgres or MySQL). A productive installation has at least 3-4 server components.

Commercially Airflow is Apache 2.0 – fully free, no embedded restrictions. Three cloud options: Astronomer (the company behind Airflow, from USD 800/month for the smallest tier, enterprise with SLA), AWS Managed Workflows for Apache Airflow (MWAA, from USD 0.40/hour for the smallest environment plus compute), Google Cloud Composer (managed Airflow on GCP, from USD 350/month). Self-host is the more usual choice for teams with DevOps experience.

For a Swiss fiduciary, Airflow is rarely directly applicable – the platform is optimised for data engineering, not for client workflows. Airflow makes sense at fiduciary platforms with data-warehouse wiring, ELT pipelines from bookkeeping systems, or daily report generation.

Why it matters

Airflow has the largest installed base in data engineering. Anyone building data pipelines finds the largest ecosystem of operators (1,000+ in the community), the most integrations with data warehouses (Snowflake, BigQuery, Redshift, Databricks), the most best practices, and the largest talent pool for hiring.

For SMEs with data requirements, three cases are realistic. First – ELT from business systems into the data warehouse. Bexio, Abacus, HubSpot, Stripe, Google Analytics – every source can be pulled via Airflow DAG daily (or hourly), transformed, and loaded into a data warehouse (Postgres, BigQuery). Anyone with a 20-50-DAG setup is well served by Airflow.

Second – complex ML pipelines. Prepare training data, train model, validate, deploy, retrain trigger – these steps can be expressed in Airflow as a DAG. With the Kubernetes pod operator, heavy compute jobs run in dedicated containers, the UI shows success/failure per step.

Third – daily report generation. PDF reports from the data warehouse, mail dispatch to clients, Slack updates to internal teams. Every step is a task in the DAG, failure notifications run via the integrated alerting system (Slack, mail, PagerDuty).

The maturity of the platform is its strength and weakness. 10 years of production use mean many documented patterns, Stack Overflow answers, and community plugins. But also: many legacy patterns (e.g. XCom for data transfer between tasks is limited), an architecture that gets unwieldy without DevOps know-how, and a UI not as modern as younger tools (Dagster, Prefect).

For a Swiss fiduciary we see Airflow most realistically as a backend component in a data platform. For client workflows directly, n8n, Windmill, or Activepieces are more practical.

How it works

An Airflow DAG is a Python file. The DAG object is created with a schedule interval (cron expression or preset like @daily), tasks are instantiated with operators, and dependencies are defined via bitshift operators (a >> b >> c). On trigger the scheduler creates a DAG run; tasks are executed in the defined order, parallel where possible, sequential where dependencies exist.

Key operator types: PythonOperator (own Python code), BashOperator (shell command), SQL operators (Postgres, Snowflake, BigQuery – very mature), KubernetesPodOperator (container in a K8s cluster), HttpOperator (REST calls), Sensor (wait for external event, e.g. "file appears in S3"). Since Airflow 2.x there is the TaskFlow API: a @task decorator turns any Python function into a task, data flow via regular Python returns instead of XCom push/pull.

A typical ELT DAG for a fiduciary: schedule @daily, task 1 "extract_bexio" (PythonOperator, calls Bexio API, writes JSON to S3), task 2 "extract_stripe" (PythonOperator, parallel to task 1), task 3 "transform" (PythonOperator, reads S3, normalises with pandas, writes back), task 4 "load_warehouse" (PostgresOperator, INSERT INTO ...), task 5 "generate_report" (PythonOperator, builds PDF), task 6 "send_report" (EmailOperator, sends PDF to a client distribution list). Dependencies: 1+2 >> 3 >> 4 >> 5 >> 6. On failure of task 3, tasks 4-6 are not executed, a Slack alert fires via on_failure_callback.

The platform architecture is distributed: the scheduler permanently decides which tasks may run. The executor distributes tasks to workers. In small setups everything runs on one server (LocalExecutor). At larger setups (>50 DAGs, >1,000 task runs/day) the CeleryExecutor or KubernetesExecutor is needed – multiple worker containers pull tasks from a Redis or K8s queue.

Versioning is a weak point. DAGs are Python files in the filesystem; the scheduler scans the filesystem on a regular schedule. Changes become effective "live", without migration logic. Best practice: version DAGs in Git and deploy via CI/CD into the Airflow filesystem. The younger competition (Dagster, Prefect) has more modern patterns here (asset-oriented model, native versioning).

Monitoring runs via the Airflow UI plus Prometheus metrics plus Slack/mail/PagerDuty alerts. The UI shows DAG runs in grid and tree view – very well thought out for debugging, one of the historical strengths of Airflow.

Airflow self-hosted in 5 steps

01Prepare the Docker-Compose stack: Airflow webserver, scheduler, worker (Celery executor), Postgres as metadata DB, Redis as queue, dags volume.
02Configure the reverse proxy: Nginx or Caddy with TLS, HTTP basic auth or OAuth for the webserver, rate limit on /api/*.
03Write the first DAGs: Python file in the dags/ volume, TaskFlow API with @dag and @task decorators, test with airflow dags test.
04Set connections and variables: API keys, DB connections, cloud credentials in the Airflow connection store (encrypted), not in DAG code.
05Set up monitoring: Prometheus scraper on the statsd exporter, Slack/mail alerts on DAG failures via on_failure_callback, Grafana dashboard.

When to use Airflow

Airflow is the right choice when (a) the main problem is data pipelines (ELT, reporting, ML training), (b) Python is the default language of the data team, (c) integration with data warehouses (Snowflake, BigQuery, Redshift) is needed, and (d) DevOps know-how for the setup is available.

Concrete cases: daily ELT from business systems into the data warehouse, ML training pipelines with GPU containers, report generation (PDF, Excel, mail dispatch), data-quality checks with Slack alerts, backup orchestration with multi-step logic, complex cron jobs with dependencies.

For fiduciary platforms with data-warehouse wiring, Airflow is a valid pick. Anyone serving hundreds of clients and producing daily data consolidation, liquidity monitoring, or client reporting benefits from the platform maturity.

For engineering teams with an existing Python stack, Airflow is often the faster entry into data pipelines than the younger alternatives (Dagster, Prefect) – community resources are larger, hiring reality is clearer (more Airflow engineers in the market), integrations are more mature.

When not to use

Airflow is the wrong choice for marketing or sales workflows. Classify incoming mail, enter into CRM, notify Slack – that is not an Airflow problem. n8n, Make, Zapier, or Activepieces are more productive there. Airflow is optimised for data pipelines with a clear DAG structure and scheduled execution.

Unsuited for real-time processing. The scheduler has a tick interval (typically 5 seconds), DAG triggers run scheduled, not event-triggered. Anyone needing to process events in real time (incoming webhooks, streaming data) uses Kafka, Apache Flink, or dedicated streaming tools.

Does not fit small setups without DevOps know-how. A production installation needs at least 3-4 server components, Postgres, reverse proxy, and monitoring. Anyone needing only 5-10 DAGs without a DevOps team is faster productive with n8n, cron jobs, or Windmill.

For interactive workflows with user approvals, Airflow is unsuited. The "sensor" concept can wait for external events, but the UI for user approvals is missing. Temporal or Windmill are the right tools there.

For language variety (workflows in TypeScript, Go, or other languages), Airflow is Python-only. Anyone with a multi-language team is more flexible with Temporal (7 SDKs) or Windmill (TS/Python/Go/Bash).

Trade-offs

STRENGTHS

De-facto standard in data engineering, largest community and hiring pool
1,000+ community operators, very broad integration with data warehouses and cloud services
Apache 2.0 without embedded restrictions, fully OSS
UI with grid and tree view very strong for debugging and operations

WEAKNESSES

Setup effort (4-5 server components), DevOps know-how mandatory
Python-only, no language variety like Temporal or Windmill
No predefined SaaS connectors – everything via Python SDKs
Versioning and live reload are legacy patterns, younger alternatives are more modern

FAQ

How is Airflow different from n8n?

Airflow is optimised for data pipelines and ELT (Python DAGs, SQL operators, schedule-driven execution). n8n is optimised for business workflows (visual editor, app connectors, event triggers). Airflow has no predefined connectors for SaaS apps like HubSpot or Slack – everything runs as Python code with SDK imports. Both run self-hosted, both are OSS, but they solve different problems.

What does Airflow cost in production?

Self-hosted is free as software (Apache 2.0). Running cost: server (CHF 80-300/month on Hetzner for webserver+scheduler+worker+Postgres+Redis), depending on volume. Astronomer (managed) from USD 800/month. AWS MWAA from USD 0.40/hour plus compute (typically USD 500-2,000/month). Google Cloud Composer from USD 350/month. Self-host pays off almost always for mid-size setups, cloud only with high SLA need.

What is new in Airflow 3.x?

Airflow 3.x (released end of 2025) brings three core changes: (1) task-execution isolation – tasks run in own processes with dedicated resource limits, less crash exposure; (2) new scheduler engine – better performance on DAG parsing, faster triggers; (3) modernised UI – grid view with better filtering, better lineage visualisation. Migration from 2.x to 3.x is feasible but not trivial (API changes in some operators).

Are Dagster and Prefect alternatives to Airflow?

Yes, both are younger competitors. Dagster (started 2018) has a "software-defined assets" model: data assets are central, not tasks. Very modern, but younger community. Prefect (2018) has a leaner API than Airflow with better user experience, but smaller operator library. Both are valid alternatives for new setups; Airflow wins on existing codebase and hiring availability.

Sources

Apache Airflow documentation – DAGs, operators, executors · 2026-05
Airflow 3.x release notes and migration guide · 2026-05
Astronomer pricing – managed Airflow · 2026-04
AWS Managed Workflows for Apache Airflow (MWAA) · 2026-04

FITS YOUR STACK?

What this looks like in your business – a 30-minute intro call.

Book a call