Today's Deep-Dive: Open Lineage
Ep. 262

Today's Deep-Dive: Open Lineage

Episode description

The deep dive discusses the challenges of data observability and the role of Open Lineage in solving these issues. Before Open Lineage, companies had to build custom tracking solutions for data lineage, which was time-consuming and prone to breaking with tool updates. Open Lineage standardizes data tracking by embedding lineage collection within data tools, making it more reliable and maintainable. The standard defines core entities like runs, jobs, and data sets, and uses facets to add flexible metadata. Open Lineage captures and sends this information through a standard API, allowing different tools to communicate lineage data consistently. The standard is widely adopted and supported by major tools like Apache Spark and DBT. The document also mentions related projects like Marquez, which visualizes lineage data, and Egeria, which manages metadata across enterprises. Open Lineage is transforming data observability from a fragmented, custom-built process into a consistent, shared framework, embedding observability directly into data infrastructure.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now for 1 Euro - 30 days free!

Download transcript (.srt)
0:00

If you've ever found yourself staring at a dashboard,

0:03

you know, the one showing just garbage data all of a sudden,

0:06

you know, that panic, you realize your core data pipeline

0:09

just broke somewhere, and then trying to trace back

0:12

which transformation job caused it,

0:15

it feels like you're navigating this huge maze in the dark.

0:20

Yeah, that feeling, being totally exposed.

0:23

Right, that's precisely the problem

0:25

that's standardizing data observability is trying to solve.

0:28

Because as these data systems get bigger and more complex,

0:31

tracking the data's whole life cycle,

0:33

I mean, where it started,

0:35

every single transformation step who used it,

0:38

it's not just theoretical anymore.

0:40

It's absolutely critical for operations,

0:43

for trust, for everything.

0:44

Absolutely, and well, that's our mission for today, really.

0:47

We're taking a deep dive into open lineage.

0:50

It's the open standard design to finally bring some order

0:53

to that chaos we were talking about.

0:56

We really wanna give you a clear, beginner-friendly way in

1:00

to understand how this standard works, why it matters,

1:02

and how it changes the game for data integrity.

1:05

Yeah, it's foundational stuff.

1:06

It really is.

1:07

It's kind of the shortcut to understanding

1:09

how your data is actually born and transformed.

1:12

And we've looked at the core specs, the adoption docs.

1:15

The goal is you walk away really understanding

1:18

this key piece of the modern data world.

1:22

Okay, cool.

1:23

Now, before we jump into defining the problem

1:25

that Open Lineage actually solves,

1:26

just a quick note that this deep dive

1:29

is supported by Safe Server.

1:30

Ah, good point.

1:32

Safe Server focuses on hosting software

1:34

and helping out with your digital transformation.

1:36

You can find more info and resources

1:38

over at www.safeserver.de.

1:41

Check them out.

1:42

Definitely.

1:42

So let's start at the beginning then.

1:44

Before Open Lineage, what was the status quo?

1:48

I mean, how bad was the pain point that made everyone say,

1:50

okay, we need a common standard like right now?

1:53

Well, the pain point was pretty simple, actually.

1:55

It was duplication, cost, and things breaking all the time.

2:00

Every company that needed data lineage,

2:02

which is basically everyone, let's be honest,

2:04

had to build it themselves from scratch.

2:06

So every single project had to instrument,

2:09

set up the tracking for all of its jobs individually.

2:12

Yeah, so if you had say 10 different data tools,

2:15

five orchestration layers,

2:17

you were trying to maintain

2:18

like 15 separate custom tracking solutions.

2:21

That just sounds like a constant grinding drain

2:24

on engineering time.

2:25

It was, a massive maintenance liability

2:27

because those custom tracking things,

2:29

they were external to the actual data tools

2:31

like Spark or Airflow.

2:32

Oh, okay, so not built in.

2:34

Not built in.

2:34

They relied on specific internal APIs,

2:37

often undocumented ones.

2:39

So the moment the tool underneath, say Spark,

2:42

released a new version, poof,

2:44

your custom lineage script just broke.

2:47

Nightwear.

2:47

Constantly playing catch up, spending thousands,

2:50

literally just to keep your basic visibility running,

2:53

it was not sustainable.

2:54

Okay, so open lineage comes along

2:57

and completely flips that script, right?

2:59

Instead of every company building

3:00

these fragile external scripts,

3:03

the effort is shared across the community.

3:06

Exactly.

3:07

That's the beauty of an open standard.

3:09

And the really elegant part.

3:10

Now, the integration itself

3:11

can be pushed inside each project.

3:13

So instead of some external script

3:15

trying to like spy on a pipeline,

3:18

the pipeline component itself

3:20

speaks the lineage language, natively.

3:23

So it's embedded.

3:24

It's embedded.

3:25

The collection is intrinsic.

3:26

So you stop worrying about versions

3:27

breaking your tracking logic.

3:28

It's just there.

3:30

It makes so much sense.

3:31

And you know, for context,

3:32

this isn't some small side project.

3:35

Open lineage is an LFAI

3:37

and data foundation graduate project.

3:40

That means it's recognized, it's battle tested,

3:42

it's a proper industry standard.

3:44

Got it.

3:44

And fundamentally, this data lineage,

3:47

it provides the foundation.

3:48

It lets you build these powerful context-aware data tools

3:52

because it tracks all that metadata

3:54

about data sets, jobs, runs.

3:56

And at that deeper understanding,

3:57

you can pinpoint the root cause

3:59

of complex problems much faster.

4:01

And crucially, you can understand the impact of changes

4:04

before you make them,

4:05

before you accidentally break something downstream.

4:08

That really highlights the operational win, doesn't it?

4:10

Reducing that oops factor.

4:11

Exactly.

4:12

Fewer oops moments.

4:13

Okay, so we know why the standard exists now.

4:16

Let's dig into what it actually defines.

4:19

If open lineage is creating this shared language

4:22

for data tracking, what are the basic words,

4:26

the components?

4:26

Yeah, good question.

4:28

Think of open lineage as defining

4:30

a kind of universal data passport.

4:32

Okay, I like that, a passport.

4:34

Yeah, and this passport dictates the consistent naming,

4:38

the structure around three core things,

4:41

three core entities that you'll find in any data flow.

4:43

Those are the run, the job, and the data set.

4:45

Okay, run, job, data set.

4:47

So the run, that's like a specific execution

4:50

of some process.

4:51

Exactly, a single instance of a job running.

4:53

And the job is the transformation logic itself,

4:56

the code, the definition.

4:58

That's right, the definition of the work to be done.

5:00

And the data set is, well, the data,

5:02

the input or the output asset.

5:03

Precisely, the thing being read from or written to,

5:06

those three are the absolute foundational building blocks.

5:09

Okay, the minimum required pieces.

5:11

They are, but the real magic, I think,

5:13

the reason the standard can grow and adapt over time

5:16

is this concept they call the facet.

5:18

Okay, facet.

5:19

So the data world is always changing, right?

5:21

New regulations pop up, new kinds of tools,

5:23

new transformations.

5:24

How does a standard like open lineage

5:27

avoid becoming obsolete?

5:29

Is that where facets come in, like a plugin system?

5:31

That's exactly it, you hit it.

5:33

Facets are the plugin model.

5:35

A facet is defined as an atomic,

5:37

sort of self-contained piece of metadata.

5:40

Atomic meaning.

5:41

Meaning it describes one specific thing,

5:44

and you can attach this facet to any of those core entities,

5:48

the run, the job, or the data set.

5:50

Ah, okay, so if run, job, and data set are the nouns,

5:54

facets are like the descriptive adjectives

5:57

you can stick onto them.

5:58

That's a great way to put it, yes.

5:59

And this is absolutely key for things like governance.

6:02

How so?

6:03

Well, you can attach things like,

6:04

say, a regulatory compliance tag,

6:06

maybe the GDPR status of a data set.

6:09

Or you could attach a detailed schema fingerprint

6:11

right onto the data set entity.

6:13

Okay.

6:13

Or maybe attach quality check results to the run entity,

6:16

and because the whole specification is defined

6:18

using OpenAPI.

6:19

That standard API stuff.

6:21

Exactly.

6:22

Developers can extend the standard basically endlessly.

6:24

They just define their own custom facets

6:27

to track whatever proprietary details they need.

6:30

This ensures the standard evolves with the industry

6:32

and not behind it.

6:33

That makes the whole system incredibly flexible.

6:36

Yeah.

6:37

Kind of future-proof, doesn't it?

6:38

That's the goal, yeah.

6:38

So we have these core entities, run, job, data set,

6:42

and they can be decorated, enriched with these facets.

6:45

How does OpenLineage actually capture and send

6:49

this information around, especially with all the different

6:51

tools people use?

6:52

Right.

6:53

So OpenLineage also defines a standard API specifically

6:57

for capturing these lineage events.

6:59

An API call, basically.

7:00

Yeah.

7:01

So the different pipeline components

7:03

think your schedulers like Airflow, your data warehouses,

7:06

your analysis tools, SQL engines, whatever.

7:08

They use the standard API call.

7:10

To report back.

7:10

To send data about the runs, the jobs, the data

7:13

sets, and any relevant facets.

7:15

They package it up according to the standard

7:17

and ship this event off to a compatible OpenLineage backend.

7:20

Gotcha.

7:21

It sounds like the essential plumbing needed

7:23

to make all these different tools finally

7:25

speak the same lineage language consistently.

7:28

That's exactly what it is.

7:29

It's the common language.

7:30

And I saw the system allows for a configurable backend

7:34

so users can choose how those events are sent,

7:37

like which protocol.

7:38

Yeah, that gives you flexibility in your architecture.

7:40

You can choose how you want to receive and process

7:42

those events.

7:43

Makes sense.

7:44

OK, now that we understand the how, let's talk adoption.

7:47

Because looking around, this standard

7:49

seems to be getting, well, pretty significant traction.

7:52

It really is, yeah.

7:53

What's the biggest challenge you see folks facing

7:55

when they try to implement this?

7:57

Is it getting started, doing the initial instrumentation?

8:01

Or is it more about handling the sheer volume of metadata

8:05

once it's flowing?

8:06

That's a good question.

8:08

The initial instrumentation effort,

8:10

it's actually decreasing pretty rapidly now,

8:12

thanks to all the community contributions,

8:14

building integrations.

8:16

The bigger challenge often is achieving true column level

8:20

lineage, especially at scale.

8:22

Column level versus table level.

8:23

Can you quickly break that down?

8:25

Sure.

8:25

Table level lineage is basically knowing that, OK, data

8:29

move from table A to table B. Useful, but limited.

8:32

Column level lineage is knowing that this specific column

8:35

in table A was used to calculate that specific column in table

8:39

B. It's much more granular.

8:41

It's like knowing not just that the package arrived,

8:43

but exactly which truck carried the crucial piece of equipment

8:47

inside that package.

8:48

Ah, OK.

8:49

So that lets you trace a specific calculation error

8:52

or maybe a compliance issue with PII right back to the source

8:56

column.

8:56

Exactly.

8:56

It's essential for that deep analysis and root cause

8:59

finding.

9:00

And we see that some heavy hitters, like Apache Spark

9:03

and DBT, they're all in. They support both table level

9:06

and that more granular column level lineage.

9:09

Yeah, that strong adoption by major tools

9:11

is absolutely critical for the standard success.

9:14

However, there's always a nuance, right?

9:16

Listeners should know that column level lineage, while super valuable,

9:19

is also inherently complex to capture perfectly in all situations.

9:24

So you will find some, let's say, edge cases or specific ways tools

9:29

like Spark or Airflow handle certain complex SQL queries

9:32

or maybe specific connectors.

9:34

Well, like the note says, for Spark, sometimes tracking lineage

9:38

through select queries that hit a JDBC source can be tricky.

9:42

Or for Airflow, column level might work great for most SQL operators,

9:47

but maybe not for a very specific BigQuery operator

9:50

doing something complex.

9:51

Got it.

9:52

So the standard provides the map, but sometimes there

9:54

are tricky intersections depending on the specific tool.

9:57

That's a good way to put it.

9:58

The standard provides the blueprint, yeah.

10:00

But the devil can be in the integration details for each tool.

10:04

The good news, though, is the community

10:06

is super active in identifying and resolving

10:08

these tool-specific exceptions.

10:10

It's constantly improving.

10:12

That's great to hear.

10:12

It's really impressive how widely embraced it is becoming.

10:15

But open lineage isn't the only name people hear in this space.

10:18

How does it fit into the wider data ecosystem?

10:21

Are there other key projects that work alongside it

10:24

or maybe depend on it?

10:26

Yeah, definitely.

10:27

It's helpful to think of the ecosystem here.

10:29

Open lineage, as we said, defines the standard format,

10:33

the language, the electrical plug, if you like,

10:36

for data lineage.

10:37

OK, the standard plug.

10:38

It guarantees the format and structure

10:40

of the metadata signal.

10:41

Now, the question is, what do you plug

10:43

in to that standard outlet?

10:45

Right, so tell us about Marquez.

10:47

That name comes up a lot.

10:49

Marquez is essentially the reference implementation

10:52

of the open lineage API.

10:54

Think of it as the back end service and the UI

10:57

you plug into the wall.

10:58

OK.

10:58

It focuses on collecting all those open lineage events,

11:01

aggregating them, storing the history,

11:04

and then visualizing the metadata.

11:05

It gives you that dashboard view of your lineage.

11:08

So open lineage provides the raw data feed.

11:11

Marquez helps you actually see it and explore the history.

11:14

Exactly.

11:14

Open lineage is the language.

11:15

Marquez helps you read the story told in that language.

11:18

Got it.

11:19

And then there's another project, Egeria.

11:21

Where does that fit in?

11:22

Is it similar to Marquez?

11:24

Egeria is a bit different.

11:25

Think bigger picture.

11:27

If Marquez is the visualizer plugged into the open lineage

11:29

outlet, Egeria is more like the central switchboard

11:33

for your entire enterprise's metadata.

11:36

The switchboard.

11:37

OK.

11:37

It offers open metadata and governance capabilities

11:40

across the whole organization.

11:42

It's designed to automatically capture, manage,

11:44

and importantly, exchange metadata

11:47

between lots of different tools and platforms,

11:49

regardless of the vendor.

11:51

So it connects things.

11:52

Yeah.

11:53

So open lineage collects the raw standardized lineage data.

11:57

Marquez can visualize that specific data.

11:59

Egeria can take that traceable lineage data

12:02

from open lineage and other metadata sources

12:04

and share it intelligently across your entire governance

12:07

system, maybe feeding it to risk management tools or data

12:10

catalogs or data science platforms.

12:13

It helps integrate lineage into broader processes.

12:15

OK, that makes sense.

12:17

Open lineage for the standard feed,

12:18

Marquez for visualization and history,

12:21

Egeria for broader enterprise metadata management

12:24

and governance integration.

12:25

You got it.

12:26

They complement each other nicely.

12:27

And it's clearly a vibrant ecosystem

12:30

developing around this.

12:31

I mean, looking at the community stats for open lineage itself,

12:35

it's not just a theoretical paper, right?

12:37

People are actively using this.

12:39

2.1 thousand stars on GitHub, nearly 400 forks.

12:44

That's real activity.

12:45

That activity is crucial, absolutely.

12:47

It shows it's solving real problems.

12:49

And look at the primary languages being used

12:51

for the core project, Java at over 60%, Python around 25%.

12:56

That mix perfectly mirrors the modern data stack, doesn't it?

13:00

You often have execution engines like Spark running on the JVM,

13:03

Java, and then orchestration layers like Airflow

13:05

heavily using Python.

13:07

So that language mix ensures it fits naturally

13:10

into the places where lineage actually

13:11

needs to be generated, broad applicability.

13:14

Fantastic.

13:15

OK, so to kind of bring this all back home,

13:17

the core takeaway here seems to be that open lineage is really

13:20

this essential open standard.

13:21

It's transforming data tracking from this fragmented, custom

13:25

built mess.

13:25

Yeah, the old way.

13:26

Into a consistent, shared framework

13:29

for collecting lineage.

13:31

It's letting all the different systems in your stack

13:34

finally communicate lineage effectively,

13:37

and hopefully killing off those maintenance

13:38

nightmares of the past.

13:39

That's the promise, and increasingly the reality.

13:42

And if we connect this to the bigger picture for a second.

13:46

Please do.

13:46

What open lineage really means is that data observability,

13:49

it's no longer this specialized, separate tool

13:52

that you have to somehow bolt onto the outside

13:54

of your systems.

13:55

It's becoming a native function, an inherent capability that's

13:59

getting embedded directly into every key piece of your data

14:02

infrastructure, from Spark to Airflow to DBT.

14:06

It's built in.

14:06

That feels like a fundamental shift.

14:08

It is, and it raises a really important question

14:10

for the future, I think.

14:11

When every data movement, every transformation

14:14

is inherently traceable, because the tools speak

14:17

open lineage natively, how will that standardized,

14:20

built-in lineage fundamentally change things like compliance

14:24

or automated data governance?

14:26

What becomes possible then?

14:29

That is a compelling thought.

14:30

What happens when observability is just part of the fabric?

14:34

Something to definitely chew on as you're

14:36

planning your next data modernization project.

14:38

Well, thank you so much for walking us

14:40

through all that.

14:40

Really helpful.

14:41

My pleasure.

14:41

Thanks for having me.

14:42

And finally, a big thank you once again to SafeServer

14:45

for supporting this deep dive into open lineage.

14:49

SafeServer supports your hosting needs

14:50

and digital transformation.

14:52

We'll catch you on the next deep dive.

14:52

We'll catch you on the next deep dive.