Today's Deep-Dive: OpenLineage

0:00

Have you ever stopped to really think about the journey your data takes, not just

0:06

where

0:06

it starts and ends up, but that whole messy path in between the changes, the

0:11

systems,

0:12

the jobs?

0:13

Right.

0:14

It's complicated.

0:15

Especially now, everything's so connected that data flow can be, well, pretty

0:19

chaotic.

0:19

It really can.

0:20

And when something breaks down or, say, a compliance report gets questioned.

0:25

Oh, yeah.

0:26

Then data teams are scrambling, trying to figure out where things went wrong across

0:29

maybe dozens of different tools.

0:31

It's like chasing ghosts.

0:32

A real productivity killer, I bet.

0:34

Absolutely.

0:35

Yeah.

0:36

That operational friction is huge.

0:37

Well, today we're diving into something designed to bring some order to all that

0:41

chaos.

0:41

Open lineage and this deep dive.

0:43

It's really for you, the learner.

0:44

We want to give you a clear starting point on what data lineage actually is, why we

0:48

need

0:48

an open standard for it, and how open lineage tackles some really big industry

0:53

problems.

0:54

And before we jump in, we want to thank the supporter of this deep dive.

0:57

Safe Server commits to hosting this software and supports you in your digital

1:03

transformation.

1:05

More information under www.safeserver.de.

1:08

So open lineage really comes at a critical time.

1:11

The sources we looked at, they call it the foundation for a new generation of

1:15

powerful

1:16

context-aware data tools.

1:18

Context-aware.

1:19

Interesting.

1:20

Yeah.

1:21

It's about collecting metadata consistently.

1:22

If you can do that, it fundamentally changes how people understand their data, how

1:26

they

1:27

trust it, how they see the impact of changes.

1:29

We're sort of moving away from siloed guesswork.

1:32

Towards a kind of universal data history.

1:34

Exactly.

1:35

That's the goal.

1:36

Okay.

1:37

Let's unpack this a bit.

1:38

Before we get into all the technical nuts and bolts, maybe we should just quickly

1:39

define

1:39

data lineage itself.

1:40

Good idea.

1:41

Keep it simple.

1:42

Right.

1:43

So stripping away the jargon, it's basically the traceable history of your data.

1:48

It tracks metadata about data sets, the jobs that process them, and the specific

1:53

times

1:53

those jobs ran the runs.

1:56

And the real goal there is simple.

1:58

If something breaks.

1:59

You can find the root cause, like which specific job run messed up the data.

2:04

Precisely.

2:05

Or, before you deploy a change, you can understand what downstream things might be

2:08

affected.

2:09

Okay.

2:10

So that capability sounds pretty essential, but you mentioned earlier it used to be

2:13

a

2:13

massive headache.

2:14

Oh, absolutely.

2:15

Before open lineage, the main pain points were, well, friction and fragility.

2:20

Think about a big company with maybe 50 different data projects.

2:24

Using all sorts of different tools, I imagine, Spark, Airflow, maybe some custom

2:28

stuff.

2:28

Exactly.

2:29

And they all needed lineage.

2:30

But here's the kicker.

2:32

Every single project had to build its own way of tracking it.

2:36

Wow.

2:37

So massive duplication of effort.

2:38

Huge duplication.

2:39

Teams were constantly writing custom code just to get basic tracking.

2:43

It was slow, expensive, and created a ton of technical debt.

2:46

I can just picture the maintenance nightmare.

2:48

What if they tried using some external tool to just, like, scrape the logs or

2:52

something?

2:52

Ah, yeah, that led to the other big problem, broken integrations, these external

2:57

tools.

2:58

They were often really brittle.

2:59

How so?

3:00

Well, say, Spark released a new version or even just changed how it logged things.

3:05

Poof, the external integration breaks.

3:08

Instantly.

3:09

Often, yeah.

3:10

So data teams would spend weeks just trying to catch up, fixing the integration,

3:14

which

3:14

meant...

3:15

They had no lineage data during that time, right, when they might need it most.

3:19

Exactly.

3:20

The lineage ended up being expensive, unreliable, and almost always incomplete.

3:24

It just wasn't working well.

3:26

Okay, so open lineage flips that script through collaboration.

3:29

That's the core idea.

3:30

Yeah.

3:31

Instead of 50 projects building 50 systems.

3:32

The effort is shared, baked right into the platforms themselves.

3:35

Yes, precisely.

3:37

By having a standard API, the exact format for the metadata integrations can be

3:41

written

3:41

once and built directly into tools like Spark or Airflow, like an internal module.

3:46

Ah, so in Spark update.

3:48

The lineage tracking updates with it.

3:50

No more constant catch up.

3:52

It dramatically cuts down that maintenance burden.

3:55

And I think you mentioned something interesting.

3:56

It instruments jobs as they run.

3:59

Yeah, that's a key aspect.

4:00

It's not trying to piece things together from logs after the fact.

4:03

Which means the metadata is captured in real time.

4:06

Exactly.

4:07

Yeah.

4:08

More accurate, more complete, especially for things like stream processing where

4:10

stuff is

4:11

happening constantly.

4:12

Okay, that makes sense.

4:13

Let's get into the standard itself then.

4:15

The definition is an open standard for metadata and lineage collection, built on a

4:19

generic

4:20

model.

4:21

What are the core pieces of that model we need to understand?

4:24

Right, so there are three core entities, the sort of nouns of the model.

4:27

They use consistent naming, which is important.

4:29

They define the what, the how, and the when.

4:32

Okay.

4:33

First, the data set.

4:34

That's the simplest, just the data being read or written.

4:37

It could be a database table, a file, a Kafka topic.

4:40

It doesn't matter.

4:41

It gets a consistent name.

4:42

Got it.

4:43

What's next?

4:44

Second, the job.

4:46

This is the defined task or process itself.

4:49

Think of it like the blueprint, like run daily sales report.

4:52

That definition.

4:53

A reusable blueprint.

4:54

Okay.

4:55

And the third, the run.

4:56

Yes, the run.

4:57

And this is crucial.

4:59

It separates the blueprint from the actual execution.

5:02

So the run is the specific instance when that job executed.

5:06

If your daily sales report, job runs every midnight.

5:09

The run is Tuesday's execution at 12 00 00 01 AM.

5:15

Why is that separation so important?

5:16

For debugging.

5:17

If a dashboard is wrong, you need to know.

5:20

Did the job definition, the code change or did this specific run fail because of

5:23

something

5:24

else like bad input data?

5:25

Yeah.

5:26

That distinction is key.

5:27

Makes perfect sense.

5:28

Okay.

5:29

So run job data set.

5:30

That gives us the basics, but you said it handles all sorts of environments, right?

5:33

ML, SQL.

5:34

How does it get that flexibility?

5:35

You mentioned something called a facet.

5:36

Yes.

5:37

Facets.

5:38

That's where real power and extensibility comes in.

5:40

So what is a facet?

5:42

The sources call it an atomic piece of metadata.

5:45

Think of it like a standardized label or a little packet of context that you attach

5:50

to

5:50

one of those core entities, a run, a job, or a data set.

5:54

An atomic piece, like a building block.

5:56

Exactly.

5:57

The core model is simple on purpose, but it's extensible because you can define

6:01

these specific

6:02

facets to add rich, relevant details.

6:05

Okay.

6:06

Give me an example.

6:07

Let's say a job uses Python.

6:09

Right.

6:10

You could attach a facet to the run entity that lists the exact Python library

6:14

versions

6:15

used during that specific run.

6:16

Or for a data set.

6:17

You could have a schema facet detailing the columns and types.

6:21

Or for a job, maybe a source code facet with the git commit hash that defined the

6:25

job logic

6:26

for that run.

6:27

I see.

6:28

So you can add very specific context column details, performance stats, code

6:32

versions

6:32

without bloating the core standard itself.

6:34

Precisely.

6:35

It allows the standard to adapt to new technologies and specific needs without

6:39

breaking that fundamental

6:40

model.

6:41

And you mentioned it uses OpenAPI.

6:43

How does that help?

6:44

Well, OpenAPI gives you a structured way to define APIs.

6:48

Because the open lineage standard and its facets are defined using OpenAPI, it

6:52

makes

6:52

it really easy for the community to propose, review, and add new custom facets for

6:57

specific

6:58

use cases.

6:59

Ah.

7:00

So even very specialized metadata can fit neatly into the standard structure.

7:04

Exactly.

7:05

So everything's consistent but also flexible.

7:07

So connecting this back to you, the learner, that structure core entities plus extensible

7:12

facets is why open lineage is potentially so powerful.

7:16

It gives you that deep context when you need it for any kind of data task while

7:20

maintaining

7:21

that universal language for tracking.

7:23

Right.

7:24

Structured but flexible.

7:25

OK.

7:26

So the design sounds solid.

7:27

But a standard is only good if people actually use it, right?

7:29

Open lineage is an LF, AI, and data foundation graduate project.

7:34

What does that mean, and what does real world adoption look like?

7:38

The graduate status is basically a seal of approval from the Linux Foundation.

7:41

It means the project is mature, well-governed, and has significant community

7:45

backing.

7:46

It's a big deal.

7:47

OK.

7:48

And the adoption.

7:49

How's the integration matrix looking?

7:50

It's actually pretty impressive, which tells you the pain point was real.

7:54

People are definitely implementing it.

7:55

What kind of details does the matrix track?

7:57

Generally, it looks at two main levels of detail.

8:00

First, table-level lineage.

8:02

So basically, did this job read from or write to this table?

8:07

Simple connection.

8:08

And the second.

8:09

The much more detailed column-level lineage.

8:12

Did this job use this specific column in this table to produce that specific column

8:17

in another

8:17

table?

8:18

Ah, that's the really granular stuff you need for compliance or impact analysis.

8:22

Exactly.

8:23

That's the gold standard.

8:24

So walk us through some highlights.

8:25

Who's integrated?

8:26

What level do they support?

8:27

OK.

8:28

Let's start with Apache Spark, huge player in big data.

8:31

Right.

8:32

They use both table-level and column-level lineage, which is great.

8:36

But there's always a but, isn't there?

8:38

Usually.

8:39

There's a key caveat.

8:40

Yeah.

8:41

It currently doesn't support column-level lineage if you use select queries with gdbc

8:45

data sources.

8:46

Interesting.

8:47

So if you want that detailed column tracking...

8:48

You have to explicitly list the columns you're selecting, which honestly is

8:52

probably bad

8:53

practice anyway.

8:54

Good point.

8:55

It nudges you towards better coding.

8:57

What about workflow tools, like Airflow?

9:01

Apache Airflow, and yeah, probably the most popular orchestrator.

9:04

It also supports both table and column-level lineage.

9:07

It can integrate directly with many SQL operators, so Airflow itself reports what

9:11

its tasks

9:12

are doing.

9:13

Nice.

9:14

And newer tools, like dbt.

9:15

Yep.

9:16

dbt is natively integrated too, supporting both levels.

9:19

Crucial for tracking those complex multi-stage data models people build with it.

9:23

Okay.

9:24

What about real-time stream processing?

9:26

Good question.

9:27

Flink, a major stream processing engine, currently supports table-level lineage.

9:32

But not column-level yet.

9:33

Not yet, according to the sources.

9:35

It highlights that instrumenting those continuous real-time flows is technically

9:39

challenging.

9:39

Yep.

9:40

But they're starting with the most critical info, which data sources are involved.

9:43

So the adoption is growing, covering key platforms.

9:46

Definitely.

9:47

And this standardization of collecting the metadata is enabling a whole ecosystem

9:51

to

9:51

build up around it.

9:52

Right.

9:53

The sources mention related projects.

9:54

Tell us about those.

9:55

How do they fit in?

9:56

Okay.

9:57

So the first is Marquez.

9:58

Marquez.

9:59

Yeah.

10:00

If Open Lineage defines the standard for creating the lineage metadata, think of

10:04

Marquez as

10:05

the reference implementation for using it.

10:07

Oh, okay.

10:08

So what does it do?

10:09

It's another LFAI and data project.

10:11

Its job is to collect all that Open Lineage metadata, aggregate it, store it, and

10:15

crucially

10:16

provide ways to visualize it.

10:18

It gives you a UI to actually explore your data ecosystem's lineage.

10:23

The display case for the metadata Open Lineage collects.

10:26

Nice.

10:27

And the other one, Igeria.

10:28

Igeria plays a slightly different role.

10:30

It's more focused on broader open metadata management and governance across an

10:34

entire

10:34

enterprise.

10:35

How does it relate to Open Lineage?

10:37

It acts like a broker or exchange.

10:39

Igeria aims to automatically capture and share metadata between all sorts of

10:43

different tools

10:44

and platforms, whatever the vendor.

10:46

So Open Lineage provides the detailed, trusted lineage data, and Igeria helps

10:50

integrate that

10:51

data into your company-wide governance, discovery, and security tools.

10:55

I see.

10:56

So Open Lineage focuses on generating the core lineage events.

10:59

Marquez provides a way to see and explore that lineage.

11:02

And Igeria helps connect that lineage information into the bigger enterprise

11:05

picture.

11:06

You got it.

11:07

Together, they're building this foundational layer for universal data observability.

11:13

It's moving things away from being locked into one vendor's proprietary system.

11:17

Making lineage more open, more democratized.

11:19

Exactly.

11:20

Okay.

11:21

We've covered a lot of ground.

11:22

To summarize for everyone listening, open lineage is this vital open standard.

11:27

It shifts that heavy linting of instrumentation away from individual teams.

11:32

Shared effort.

11:33

And enables consistent detailed tracking of data runs, jobs, and data sets.

11:38

And it does this using that flexible model with core entities and those extensible

11:42

facets.

11:43

Right.

11:44

And the adoption is real.

11:45

That integration matrix shows it's getting picked up by major platforms.

11:48

Yeah.

11:49

And so for the learner, the big takeaway is that having this standardized shared

11:53

metadata

11:53

governed by projects like open lineage, it really changes the game for data trust.

11:58

How so?

11:59

Well, for security, for audits, for just basic operational efficiency.

12:03

Knowing the true history of your data becomes feasible at scale.

12:06

It's fundamental for building reliable data systems.

12:09

So here's a final thought to leave you with.

12:11

Imagine if every single piece of data processing, whether it's a simple query or a

12:15

super complex

12:16

ML model training.

12:18

What if it could automatically report its own complete verifiable history using

12:23

open

12:23

lineage?

12:24

What kinds of new applications could that unlock?

12:27

Think about automated governance or real time security checks or maybe even new

12:31

forms of

12:32

business intelligence based on process history.

12:35

Yeah.

12:36

If the data reports itself, that opens up a lot of possibilities.

12:39

The future of data might just be self-reporting.

12:41

A provocative thought indeed.

12:43

Thank you for joining us on this deep dive into open lineage.

13:01

We hope this helps you see not just the data, but the flow, the connections, the

13:04

behind it.

13:04

behind it.

Today's Deep-Dive: OpenLineage

Episode description

Persons