Today's Deep-Dive: Open Lineage

0:00

If you've ever found yourself staring at a dashboard,

0:03

you know, the one showing just garbage data all of a sudden,

0:06

you know, that panic, you realize your core data pipeline

0:09

just broke somewhere, and then trying to trace back

0:12

which transformation job caused it,

0:15

it feels like you're navigating this huge maze in the dark.

0:20

Yeah, that feeling, being totally exposed.

0:23

Right, that's precisely the problem

0:25

that's standardizing data observability is trying to solve.

0:28

Because as these data systems get bigger and more complex,

0:31

tracking the data's whole life cycle,

0:33

I mean, where it started,

0:35

every single transformation step who used it,

0:38

it's not just theoretical anymore.

0:40

It's absolutely critical for operations,

0:43

for trust, for everything.

0:44

Absolutely, and well, that's our mission for today, really.

0:47

We're taking a deep dive into open lineage.

0:50

It's the open standard design to finally bring some order

0:53

to that chaos we were talking about.

0:56

We really wanna give you a clear, beginner-friendly way in

1:00

to understand how this standard works, why it matters,

1:02

and how it changes the game for data integrity.

1:05

Yeah, it's foundational stuff.

1:06

It really is.

1:07

It's kind of the shortcut to understanding

1:09

how your data is actually born and transformed.

1:12

And we've looked at the core specs, the adoption docs.

1:15

The goal is you walk away really understanding

1:18

this key piece of the modern data world.

1:22

Okay, cool.

1:23

Now, before we jump into defining the problem

1:25

that Open Lineage actually solves,

1:26

just a quick note that this deep dive

1:29

is supported by Safe Server.

1:30

Ah, good point.

1:32

Safe Server focuses on hosting software

1:34

and helping out with your digital transformation.

1:36

You can find more info and resources

1:38

over at www.safeserver.de.

1:41

Check them out.

1:42

Definitely.

1:42

So let's start at the beginning then.

1:44

Before Open Lineage, what was the status quo?

1:48

I mean, how bad was the pain point that made everyone say,

1:50

okay, we need a common standard like right now?

1:53

Well, the pain point was pretty simple, actually.

1:55

It was duplication, cost, and things breaking all the time.

2:00

Every company that needed data lineage,

2:02

which is basically everyone, let's be honest,

2:04

had to build it themselves from scratch.

2:06

So every single project had to instrument,

2:09

set up the tracking for all of its jobs individually.

2:12

Yeah, so if you had say 10 different data tools,

2:15

five orchestration layers,

2:17

you were trying to maintain

2:18

like 15 separate custom tracking solutions.

2:21

That just sounds like a constant grinding drain

2:24

on engineering time.

2:25

It was, a massive maintenance liability

2:27

because those custom tracking things,

2:29

they were external to the actual data tools

2:31

like Spark or Airflow.

2:32

Oh, okay, so not built in.

2:34

Not built in.

2:34

They relied on specific internal APIs,

2:37

often undocumented ones.

2:39

So the moment the tool underneath, say Spark,

2:42

released a new version, poof,

2:44

your custom lineage script just broke.

2:47

Nightwear.

2:47

Constantly playing catch up, spending thousands,

2:50

literally just to keep your basic visibility running,

2:53

it was not sustainable.

2:54

Okay, so open lineage comes along

2:57

and completely flips that script, right?

2:59

Instead of every company building

3:00

these fragile external scripts,

3:03

the effort is shared across the community.

3:06

Exactly.

3:07

That's the beauty of an open standard.

3:09

And the really elegant part.

3:10

Now, the integration itself

3:11

can be pushed inside each project.

3:13

So instead of some external script

3:15

trying to like spy on a pipeline,

3:18

the pipeline component itself

3:20

speaks the lineage language, natively.

3:23

So it's embedded.

3:24

It's embedded.

3:25

The collection is intrinsic.

3:26

So you stop worrying about versions

3:27

breaking your tracking logic.

3:28

It's just there.

3:30

It makes so much sense.

3:31

And you know, for context,

3:32

this isn't some small side project.

3:35

Open lineage is an LFAI

3:37

and data foundation graduate project.

3:40

That means it's recognized, it's battle tested,

3:42

it's a proper industry standard.

3:44

Got it.

3:44

And fundamentally, this data lineage,

3:47

it provides the foundation.

3:48

It lets you build these powerful context-aware data tools

3:52

because it tracks all that metadata

3:54

about data sets, jobs, runs.

3:56

And at that deeper understanding,

3:57

you can pinpoint the root cause

3:59

of complex problems much faster.

4:01

And crucially, you can understand the impact of changes

4:04

before you make them,

4:05

before you accidentally break something downstream.

4:08

That really highlights the operational win, doesn't it?

4:10

Reducing that oops factor.

4:11

Exactly.

4:12

Fewer oops moments.

4:13

Okay, so we know why the standard exists now.

4:16

Let's dig into what it actually defines.

4:19

If open lineage is creating this shared language

4:22

for data tracking, what are the basic words,

4:26

the components?

4:26

Yeah, good question.

4:28

Think of open lineage as defining

4:30

a kind of universal data passport.

4:32

Okay, I like that, a passport.

4:34

Yeah, and this passport dictates the consistent naming,

4:38

the structure around three core things,

4:41

three core entities that you'll find in any data flow.

4:43

Those are the run, the job, and the data set.

4:45

Okay, run, job, data set.

4:47

So the run, that's like a specific execution

4:50

of some process.

4:51

Exactly, a single instance of a job running.

4:53

And the job is the transformation logic itself,

4:56

the code, the definition.

4:58

That's right, the definition of the work to be done.

5:00

And the data set is, well, the data,

5:02

the input or the output asset.

5:03

Precisely, the thing being read from or written to,

5:06

those three are the absolute foundational building blocks.

5:09

Okay, the minimum required pieces.

5:11

They are, but the real magic, I think,

5:13

the reason the standard can grow and adapt over time

5:16

is this concept they call the facet.

5:18

Okay, facet.

5:19

So the data world is always changing, right?

5:21

New regulations pop up, new kinds of tools,

5:23

new transformations.

5:24

How does a standard like open lineage

5:27

avoid becoming obsolete?

5:29

Is that where facets come in, like a plugin system?

5:31

That's exactly it, you hit it.

5:33

Facets are the plugin model.

5:35

A facet is defined as an atomic,

5:37

sort of self-contained piece of metadata.

5:40

Atomic meaning.

5:41

Meaning it describes one specific thing,

5:44

and you can attach this facet to any of those core entities,

5:48

the run, the job, or the data set.

5:50

Ah, okay, so if run, job, and data set are the nouns,

5:54

facets are like the descriptive adjectives

5:57

you can stick onto them.

5:58

That's a great way to put it, yes.

5:59

And this is absolutely key for things like governance.

6:02

How so?

6:03

Well, you can attach things like,

6:04

say, a regulatory compliance tag,

6:06

maybe the GDPR status of a data set.

6:09

Or you could attach a detailed schema fingerprint

6:11

right onto the data set entity.

6:13

Okay.

6:13

Or maybe attach quality check results to the run entity,

6:16

and because the whole specification is defined

6:18

using OpenAPI.

6:19

That standard API stuff.

6:21

Exactly.

6:22

Developers can extend the standard basically endlessly.

6:24

They just define their own custom facets

6:27

to track whatever proprietary details they need.

6:30

This ensures the standard evolves with the industry

6:32

and not behind it.

6:33

That makes the whole system incredibly flexible.

6:36

Yeah.

6:37

Kind of future-proof, doesn't it?

6:38

That's the goal, yeah.

6:38

So we have these core entities, run, job, data set,

6:42

and they can be decorated, enriched with these facets.

6:45

How does OpenLineage actually capture and send

6:49

this information around, especially with all the different

6:51

tools people use?

6:52

Right.

6:53

So OpenLineage also defines a standard API specifically

6:57

for capturing these lineage events.

6:59

An API call, basically.

7:00

Yeah.

7:01

So the different pipeline components

7:03

think your schedulers like Airflow, your data warehouses,

7:06

your analysis tools, SQL engines, whatever.

7:08

They use the standard API call.

7:10

To report back.

7:10

To send data about the runs, the jobs, the data

7:13

sets, and any relevant facets.

7:15

They package it up according to the standard

7:17

and ship this event off to a compatible OpenLineage backend.

7:20

Gotcha.

7:21

It sounds like the essential plumbing needed

7:23

to make all these different tools finally

7:25

speak the same lineage language consistently.

7:28

That's exactly what it is.

7:29

It's the common language.

7:30

And I saw the system allows for a configurable backend

7:34

so users can choose how those events are sent,

7:37

like which protocol.

7:38

Yeah, that gives you flexibility in your architecture.

7:40

You can choose how you want to receive and process

7:42

those events.

7:43

Makes sense.

7:44

OK, now that we understand the how, let's talk adoption.

7:47

Because looking around, this standard

7:49

seems to be getting, well, pretty significant traction.

7:52

It really is, yeah.

7:53

What's the biggest challenge you see folks facing

7:55

when they try to implement this?

7:57

Is it getting started, doing the initial instrumentation?

8:01

Or is it more about handling the sheer volume of metadata

8:05

once it's flowing?

8:06

That's a good question.

8:08

The initial instrumentation effort,

8:10

it's actually decreasing pretty rapidly now,

8:12

thanks to all the community contributions,

8:14

building integrations.

8:16

The bigger challenge often is achieving true column level

8:20

lineage, especially at scale.

8:22

Column level versus table level.

8:23

Can you quickly break that down?

8:25

Sure.

8:25

Table level lineage is basically knowing that, OK, data

8:29

move from table A to table B. Useful, but limited.

8:32

Column level lineage is knowing that this specific column

8:35

in table A was used to calculate that specific column in table

8:39

B. It's much more granular.

8:41

It's like knowing not just that the package arrived,

8:43

but exactly which truck carried the crucial piece of equipment

8:47

inside that package.

8:48

Ah, OK.

8:49

So that lets you trace a specific calculation error

8:52

or maybe a compliance issue with PII right back to the source

8:56

column.

8:56

Exactly.

8:56

It's essential for that deep analysis and root cause

8:59

finding.

9:00

And we see that some heavy hitters, like Apache Spark

9:03

and DBT, they're all in. They support both table level

9:06

and that more granular column level lineage.

9:09

Yeah, that strong adoption by major tools

9:11

is absolutely critical for the standard success.

9:14

However, there's always a nuance, right?

9:16

Listeners should know that column level lineage, while super valuable,

9:19

is also inherently complex to capture perfectly in all situations.

9:24

So you will find some, let's say, edge cases or specific ways tools

9:29

like Spark or Airflow handle certain complex SQL queries

9:32

or maybe specific connectors.

9:34

Well, like the note says, for Spark, sometimes tracking lineage

9:38

through select queries that hit a JDBC source can be tricky.

9:42

Or for Airflow, column level might work great for most SQL operators,

9:47

but maybe not for a very specific BigQuery operator

9:50

doing something complex.

9:51

Got it.

9:52

So the standard provides the map, but sometimes there

9:54

are tricky intersections depending on the specific tool.

9:57

That's a good way to put it.

9:58

The standard provides the blueprint, yeah.

10:00

But the devil can be in the integration details for each tool.

10:04

The good news, though, is the community

10:06

is super active in identifying and resolving

10:08

these tool-specific exceptions.

10:10

It's constantly improving.

10:12

That's great to hear.

10:12

It's really impressive how widely embraced it is becoming.

10:15

But open lineage isn't the only name people hear in this space.

10:18

How does it fit into the wider data ecosystem?

10:21

Are there other key projects that work alongside it

10:24

or maybe depend on it?

10:26

Yeah, definitely.

10:27

It's helpful to think of the ecosystem here.

10:29

Open lineage, as we said, defines the standard format,

10:33

the language, the electrical plug, if you like,

10:36

for data lineage.

10:37

OK, the standard plug.

10:38

It guarantees the format and structure

10:40

of the metadata signal.

10:41

Now, the question is, what do you plug

10:43

in to that standard outlet?

10:45

Right, so tell us about Marquez.

10:47

That name comes up a lot.

10:49

Marquez is essentially the reference implementation

10:52

of the open lineage API.

10:54

Think of it as the back end service and the UI

10:57

you plug into the wall.

10:58

OK.

10:58

It focuses on collecting all those open lineage events,

11:01

aggregating them, storing the history,

11:04

and then visualizing the metadata.

11:05

It gives you that dashboard view of your lineage.

11:08

So open lineage provides the raw data feed.

11:11

Marquez helps you actually see it and explore the history.

11:14

Exactly.

11:14

Open lineage is the language.

11:15

Marquez helps you read the story told in that language.

11:18

Got it.

11:19

And then there's another project, Egeria.

11:21

Where does that fit in?

11:22

Is it similar to Marquez?

11:24

Egeria is a bit different.

11:25

Think bigger picture.

11:27

If Marquez is the visualizer plugged into the open lineage

11:29

outlet, Egeria is more like the central switchboard

11:33

for your entire enterprise's metadata.

11:36

The switchboard.

11:37

OK.

11:37

It offers open metadata and governance capabilities

11:40

across the whole organization.

11:42

It's designed to automatically capture, manage,

11:44

and importantly, exchange metadata

11:47

between lots of different tools and platforms,

11:49

regardless of the vendor.

11:51

So it connects things.

11:52

Yeah.

11:53

So open lineage collects the raw standardized lineage data.

11:57

Marquez can visualize that specific data.

11:59

Egeria can take that traceable lineage data

12:02

from open lineage and other metadata sources

12:04

and share it intelligently across your entire governance

12:07

system, maybe feeding it to risk management tools or data

12:10

catalogs or data science platforms.

12:13

It helps integrate lineage into broader processes.

12:15

OK, that makes sense.

12:17

Open lineage for the standard feed,

12:18

Marquez for visualization and history,

12:21

Egeria for broader enterprise metadata management

12:24

and governance integration.

12:25

You got it.

12:26

They complement each other nicely.

12:27

And it's clearly a vibrant ecosystem

12:30

developing around this.

12:31

I mean, looking at the community stats for open lineage itself,

12:35

it's not just a theoretical paper, right?

12:37

People are actively using this.

12:39

2.1 thousand stars on GitHub, nearly 400 forks.

12:44

That's real activity.

12:45

That activity is crucial, absolutely.

12:47

It shows it's solving real problems.

12:49

And look at the primary languages being used

12:51

for the core project, Java at over 60%, Python around 25%.

12:56

That mix perfectly mirrors the modern data stack, doesn't it?

13:00

You often have execution engines like Spark running on the JVM,

13:03

Java, and then orchestration layers like Airflow

13:05

heavily using Python.

13:07

So that language mix ensures it fits naturally

13:10

into the places where lineage actually

13:11

needs to be generated, broad applicability.

13:14

Fantastic.

13:15

OK, so to kind of bring this all back home,

13:17

the core takeaway here seems to be that open lineage is really

13:20

this essential open standard.

13:21

It's transforming data tracking from this fragmented, custom

13:25

built mess.

13:25

Yeah, the old way.

13:26

Into a consistent, shared framework

13:29

for collecting lineage.

13:31

It's letting all the different systems in your stack

13:34

finally communicate lineage effectively,

13:37

and hopefully killing off those maintenance

13:38

nightmares of the past.

13:39

That's the promise, and increasingly the reality.

13:42

And if we connect this to the bigger picture for a second.

13:46

Please do.

13:46

What open lineage really means is that data observability,

13:49

it's no longer this specialized, separate tool

13:52

that you have to somehow bolt onto the outside

13:54

of your systems.

13:55

It's becoming a native function, an inherent capability that's

13:59

getting embedded directly into every key piece of your data

14:02

infrastructure, from Spark to Airflow to DBT.

14:06

It's built in.

14:06

That feels like a fundamental shift.

14:08

It is, and it raises a really important question

14:10

for the future, I think.

14:11

When every data movement, every transformation

14:14

is inherently traceable, because the tools speak

14:17

open lineage natively, how will that standardized,

14:20

built-in lineage fundamentally change things like compliance

14:24

or automated data governance?

14:26

What becomes possible then?

14:29

That is a compelling thought.

14:30

What happens when observability is just part of the fabric?

14:34

Something to definitely chew on as you're

14:36

planning your next data modernization project.

14:38

Well, thank you so much for walking us

14:40

through all that.

14:40

Really helpful.

14:41

My pleasure.

14:41

Thanks for having me.

14:42

And finally, a big thank you once again to SafeServer

14:45

for supporting this deep dive into open lineage.

14:49

SafeServer supports your hosting needs

14:50

and digital transformation.

14:52

We'll catch you on the next deep dive.

14:52

We'll catch you on the next deep dive.

Today's Deep-Dive: Open Lineage

Episode description

Persons