Today's Deep-Dive: Apache Airflow
Ep. 309

Today's Deep-Dive: Apache Airflow

Episode description

Apache Airflow is an open-source orchestration tool designed to programmatically author, schedule, and monitor complex workflows, from simple scripts to machine learning pipelines. It treats workflows as software, allowing them to be version-controlled, tested, and collaborated on effectively. Airflow’s core concept is the Directed Acyclic Graph (DAG), which defines tasks and their dependencies, ensuring tasks run in the correct order. Workflows are written in Python, offering flexibility and ease of use for developers familiar with the language. Key principles of Airflow include being dynamic, allowing pipelines to be generated programmatically; scalable, with a modular architecture designed to distribute tasks across multiple workers; extensible, enabling custom operators and hooks; and elegant, promoting lean, clear, and explicit pipeline code. The platform features a robust web UI for monitoring, managing, and triggering workflows, offering visual insights through DAGs, graphs, and grids, which is a significant improvement over traditional command-line scheduling. Airflow integrates with numerous services across major cloud providers, databases, and messaging systems. For production use, Airflow requires a Linux environment and a robust database like PostgreSQL or MySQL, moving beyond local development setups using SQLite. The project follows semantic versioning for predictability and actively manages dependencies to ensure stability, making it a reliable standard for complex automation tasks that need to be maintainable and observable.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now!

Download transcript (.srt)
0:00

Okay, let's dive in. Welcome back to the deep dive where we try to make sense of

0:04

complex tech

0:04

stuff and make it useful for you. Today we're tackling Apache Airflow. It's this,

0:10

well,

0:10

this orchestration tool you hear about everywhere, handling simple scripts right up

0:14

to really complex

0:15

ML pipelines. And our mission today, especially for you, the listener maybe just

0:19

starting out,

0:20

is to really pull back the curtain on Airflow. We want you to walk away

0:24

understanding what it

0:25

actually is, its core ideas, and why, frankly, it's become such a standard for

0:31

managing workflows.

0:33

Yeah, absolutely. I mean, if you've ever had that horrible 2 a.m. wake-up call

0:36

because some

0:36

critical nightly script failed and you're scrambling across like a dozen systems

0:41

just

0:41

to figure out what went wrong, well, that's the core problem Airflow really sets

0:44

out to solve.

0:45

The sources define it pretty clearly. It's a platform built by the community

0:49

to programmatically author, schedule, and monitor workflows.

0:52

Programmatically. Okay, that sounds key. It is key. That's the shift in thinking we

0:57

want you to grasp. It's not just about setting timers. It's treating the whole

1:01

workflow,

1:01

the entire process, as actual software. And when you do that, when it's all code,

1:05

suddenly it's maintainable. You can version it like any other code. You can test it

1:09

properly.

1:10

And maybe most importantly, multiple people can collaborate on it effectively.

1:14

Right. That leap from just having scattered scripts everywhere to having this codified,

1:18

managed process. That's the aha moment. That's what makes Airflow feel almost

1:23

indispensable

1:23

once you've used it. Okay. That makes sense. And handling these complex automations,

1:28

well,

1:28

it needs solid infrastructure. So we really want to thank the supporter of this

1:32

Deep Dive Safe

1:33

Server. Safe Server supports hosting for tools like Airflow and helps with your

1:37

digital

1:38

transformation journey. You can find out more at www.safeserver.de. All right. Let's

1:44

get into

1:44

the nuts and bolts then. The building blocks. The sources really stress that Airflow

1:49

defines these

1:51

sometimes really complex multi-step processes entirely in Pure Python. Yeah. And

1:56

that's a huge,

1:57

huge advantage for getting started and for keeping things maintainable down the

2:00

line.

2:00

If you know Tython, you're basically good to go. You just use standard Python

2:05

features you already

2:06

know, like datetime formats for scheduling. You can use loops to generate tasks

2:11

dynamically.

2:12

It gives you complete flexibility. So no weird XML files or obscure command line

2:17

flags to learn?

2:18

Nope. None of that black magic. Just Python. Okay. So we define all the steps in

2:22

Python.

2:23

How does Airflow actually know the order to run things in? What manages the

2:27

dependencies?

2:28

Ah, that brings us to a really core concept. The DAG. That stands for Directed Acyclic

2:33

Graph.

2:34

Right. DAGs. Heard that term. It's the fundamental unit of work in Airflow.

2:37

The DAG is essentially the blueprint for your workflow. It lays out all the

2:41

individual tasks

2:42

and, crucially, the dependencies between them. Then the Airflow scheduler looks at

2:46

that DAG

2:47

and executes the tasks, making absolutely sure that, say, step B doesn't even think

2:52

about starting

2:53

until step A has finished successfully. That control aspect sounds really powerful.

2:58

But hang on. If my process is always the same, you know, pretty static,

3:01

why wouldn't I just use a basic scheduled script or maybe a simple serverless

3:06

function?

3:06

Why add the overhead of Airflow? That's a really good question,

3:09

and it touches on the difference between basic scheduling and proper orchestration.

3:14

Airflow brings all the extras. Handling complex failures and retries, figuring out

3:18

dependencies,

3:19

providing visibility, things simple schedulers just don't do. But you're right

3:24

about the scope.

3:25

Airflow really shines when your workflows are, let's say, mostly static or change

3:29

slowly over

3:30

time. And critically, the tasks inside your workflow. Ideally, they need to be idempotent.

3:36

Idempotent. Okay, define that for us. Idempotent means that running the same

3:39

task multiple times produces the exact same result as running it successfully just

3:43

once.

3:44

Ah, so if a task fails halfway through and Airflow automatically reruns it.

3:48

Exactly. You don't want it creating duplicate data or sending the same email twice

3:53

or charging

3:54

a credit card again. You have to design your tasks so that rerunning them is safe

3:59

and leads

3:59

to the same final state. It's a vital discipline for stable pipelines.

4:03

Got it. So Airflow is like the orchestra conductor, making sure everyone plays

4:07

their part,

4:08

even if they need to restart a measure. But I remember the sources warning,

4:11

it's not for moving huge amounts of data between tasks.

4:14

That's correct. Definitely not a streaming solution. Tasks can pass small bits of

4:19

information,

4:19

little pieces of metadata between each other using something called XCOMs cross

4:23

communication.

4:24

XCOMs.

4:25

But think of XCOMs, like passing a little note, maybe a file path, a database

4:29

record ID,

4:30

or just a status flag. They're absolutely not designed for shuffling gigabytes of

4:35

data around.

4:36

If you have tasks that need to process large volumes of data, the best practice is

4:40

always,

4:41

always to delegate that heavy lifting to an external system built for it, like a

4:44

database

4:45

query, a Spark job, or a dedicated data processing service. Airflow just triggers

4:49

and monitors it.

4:50

Okay, makes sense. Keep Airflow focused on the orchestration. Now, thinking about

4:54

best practices,

4:55

let's talk about the core design ideas that make Airflow so popular. The source has

5:00

mentioned four

5:01

key principles. Let's start with dynamic. What's the benefit there? This is a

5:06

massive difference

5:07

compared to older scheduling tools. Because your pipelines are just Python code,

5:11

you can use all

5:12

the power of Python loops, functions, classes, imports, conditional logic to

5:17

actually generate

5:18

your pipelines dynamically. Wait, so say I onboard like 50 new clients and each one

5:23

needs a slightly

5:24

different daily reporting pipeline. I don't have to manually create 50 separate DAG

5:28

files. Exactly.

5:30

You could write Python code that reads a list of clients and generates a unique

5:33

parameterized

5:34

DAG instance for each one automatically. That's incredibly powerful for managing

5:38

complexity at

5:38

scale. Wow, okay. That saves a ton of manual effort. What's next? Scalable. Right.

5:43

Airflow

5:44

is built with a modular architecture. It uses things like a method queue, think Celery

5:48

or Rabbit

5:49

MQ, to distribute tasks out to potentially many worker machines. It's designed from

5:55

the ground up

5:55

to scale out horizontally. You can add more workers as your workload grows. It's

6:00

meant to scale,

6:01

theoretically, to infinity. Okay. Dynamic generation scales out. What about

6:07

connecting

6:07

to everything? That sounds like extensible. Precisely. You're not stuck with just

6:12

the built-in

6:13

tools. Airflow has this concept of operators. If the standard operator for, say,

6:18

interacting with

6:19

your specific database or cloud service doesn't quite do what you need... You can

6:22

just write your

6:23

own. Yep. You can easily define your own custom operators, hooks, and sensors. You

6:26

can extend the

6:27

libraries to create the exact level of abstraction that makes sense for your team

6:30

and your environment.

6:31

It's very flexible. And the last principle, elegant. That sounds a bit subjective.

6:36

It speaks more to

6:36

the developer experience, I think. The idea is that the pipelines themselves, the

6:40

Python code

6:41

defining the DAGs, should be lean, clear, and explicit. It also uses the Jinja

6:46

templating engine

6:47

pretty heavily, which is built right into the core. This lets you parameterize your

6:52

tasks,

6:53

really effectively passing in dates, configurations, things like that, without

6:58

making the Python code

6:59

itself overly complicated or messy. It keeps things readable. That internal elegance

7:05

seems

7:05

to carry over to the outside, too, because, honestly, having spent way too many

7:09

hours staring

7:10

at cryptic Cron logs in a terminal, the fact that Airflow has a proper visual UI

7:16

feels like

7:17

more than just a nice-to-have. It feels essential. Oh, it's a huge part of the

7:20

appeal. It's definitely

7:21

the anti-Cron experience. One of its absolute standout features is the useful UI.

7:26

It gives you

7:26

this really robust modern web application where you can see everything, you monitor

7:30

workflows,

7:30

you can trigger them manually, you can manage connections, variables. It's all

7:34

visual. You get

7:35

full insight into what's running, what failed, and access to the logs for every

7:39

single task run.

7:40

And it's not just one basic dashboard, right? There are specific views.

7:43

Yeah, several really helpful ones. There's the main DAGs overview showing all your

7:48

pipelines.

7:49

There's the grid view, which is great for seeing task statuses laid out over time.

7:54

And critically,

7:54

there's the graph view. This actually draws out your DAG showing all the tasks and

7:59

their dependencies,

8:00

and it colors them based on the status of a specific run. You can instantly see the

8:05

flow

8:05

and pinpoint exactly where something went wrong and often why. There's also a code

8:09

view to see

8:10

the data source code directly in the UI. That visual debugging is invaluable. And

8:15

this ties

8:16

into the robust integrations, doesn't it? The UI manages connections to all sorts

8:20

of systems.

8:21

Absolutely. Airflow comes packed with plug and play operators for basically

8:24

everything you'd

8:24

expect in a modern tech stack. All the major clouds, Google Cloud Platform, AWS,

8:29

Microsoft Azure plus

8:30

databases, messaging systems, container orchestrators, data warehouses, tons of

8:35

third-party services too. So chances are whatever infrastructure you're using now

8:39

or planning to use,

8:40

Airflow probably has ready-made components to interact with it. That makes adoption

8:44

much

8:44

smoother. It definitely feels enterprise ready, which probably explains the huge

8:49

community around

8:50

it. It's an official Apache Software Foundation project, right? Open source.

8:54

Completely open

8:55

source. And the community is massive and very active. You mentioned the GitHub

9:00

stars earlier,

9:01

tens of thousands, thousands of contributors. There's a really busy Slack channel

9:05

where people

9:06

help each other out. It's a very vibrant ecosystem. That open source nature makes

9:10

it easy to get

9:11

started, which is great for beginners. But you mentioned earlier, there's a

9:14

difference between

9:14

just running it locally and setting it up for real, for production. Yes, that's a

9:19

critical distinction.

9:20

Anyone with some Python knowledge can probably get a simple workflow running on

9:25

their laptop

9:26

fairly quickly. It is easy to use in that sense. However, deploying Airflow for

9:31

production

9:31

workloads has some strict requirements you absolutely need to be aware of. Okay,

9:35

like what?

9:35

The big one is the operating system. Airflow is only officially supported for

9:39

production on

9:40

POCX compliant operating systems. Basically, that means Linux. The community

9:46

maintains a

9:46

reference Docker image based on Debian Bookworm, which is a good standard to follow

9:51

if you're a

9:51

Windows user. No native support. No native production support. You must use either

9:56

the

9:56

Windows subsystem for Linux version 2, WSL 2, or run Airflow within Linux

10:01

containers,

10:02

perhaps using Docker desktop. That's non-negotiable for a stable production setup.

10:07

Okay, that's a major infrastructure point. Linux first. What else? The database.

10:11

When you first install Airflow, it defaults to using Schoolite. That's fine,

10:15

only for local development and testing, just to try things out. But not for

10:18

production.

10:19

Absolutely not recommended for production. Schoolite doesn't handle concurrent

10:23

access well,

10:24

which you definitely have in a production Airflow setup with multiple components

10:27

hitting the

10:27

database. For production, you really need a proper, robust database like PostgreSQL

10:32

or MySQL.

10:33

About it. Use a real database. That need for stability seems reflected in how they

10:37

manage the project itself, too. Yeah, they become quite rigorous about it.

10:41

Since version 2.0.0, Airflow follows strict semantic versioning major dot minor dot

10:46

patch.

10:46

This gives you predictability. You know a PADCH release won't break things,

10:51

and a minor release might add features but should be backward compatible within

10:54

that

10:55

major version. They also actively manage their dependencies on other big libraries,

11:00

things like Skoll Alchemy for the database interaction, Flask for the web UI,

11:04

Celery or Kubernetes for scaling. They often set upper version bounds on these

11:08

dependencies.

11:09

Why do that? To ensure stability. It prevents

11:11

a situation where you upgrade Airflow, but an underlying library it depends on has

11:15

also

11:16

updated with a breaking change you weren't expecting. By pinning or capping

11:19

dependency

11:20

versions, they provide a more predictable and stable upgrade experience for you.

11:24

Okay, so wrapping this up, what's the main takeaway for someone listening, maybe

11:28

new to this?

11:29

It seems like if you're dealing with complex automation tasks, especially ones that

11:34

run

11:34

regularly and need to be reliable, Airflow is pretty much the standard way to go.

11:38

It lets you

11:39

define everything in Python, making it maintainable and testable, and gives you

11:43

that powerful UI to

11:44

see what's going on, ditching those old command line headaches. That sums it up

11:48

well. It brings

11:49

software engineering best practices to your automation workflows. And maybe a final

11:54

thought,

11:54

a challenge for you to consider, is you start building your own DAGs. We talked a

11:57

lot about

11:58

idempotency, how crucial it is that tasks can be rerun safely. So pause and really

12:03

think about the

12:04

real world consequences when that principle is violated. What are the most common,

12:08

maybe even

12:09

disastrous pipeline failures you can imagine that happen precisely because a step

12:13

wasn't idempotent?

12:15

Think about things like duplicate financial transactions, or sending marketing

12:19

emails out

12:19

multiple times by accident. Then consider, how does the very act of defining your

12:23

workflow,

12:24

step-by-step, and structured Python code force you to confront and design against

12:28

those kinds

12:29

of devastating mistakes up front? That enforced discipline. That's where Airflow

12:33

adds a huge layer

12:34

of safety and security. That's a great point. Designing for failure and reruns

12:37

right from the

12:38

start. Excellent food for thought. And once again, a big thank you to Safe Server

12:42

for supporting this

12:42

deep dive. Safe Server helps you host software like Airflow and manage your digital

12:46

transformation.

12:47

We'll catch you on the next deep dive.

12:47

We'll catch you on the next deep dive.