Today's Deep-Dive: OpenLineage
Ep. 275

Today's Deep-Dive: OpenLineage

Episode description

The deep dive the complexities of data flow and the challenges data teams face in tracking and managing data lineage. It introduces OpenLineage as a solution to bring order to the chaotic data journey. OpenLineage is described as an open standard for collecting metadata, which helps in understanding data history, trusting data, and seeing the impact of changes. The text defines data lineage as the traceable history of data, tracking metadata about datasets, jobs, and their execution times. Before OpenLineage, tracking data lineage was a massive headache due to duplication of effort, fragile integrations, and incomplete data. OpenLineage addresses these issues through collaboration, sharing the effort across platforms, and capturing metadata in real-time. The standard uses a flexible model with core entities (dataset, job, run) and extensible facets for detailed metadata. The text also highlights real-world adoption, mentioning integrations with major platforms like Apache Spark, Airflow, and dbt. Additionally, it discusses related projects like Marquez and Igeria, which help visualize and integrate lineage data. This episode concludes by emphasizing the potential of OpenLineage in enabling data trust, security, and new applications.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now for 1 Euro - 30 days free!

Download transcript (.srt)
0:00

Have you ever stopped to really think about the journey your data takes, not just

0:06

where

0:06

it starts and ends up, but that whole messy path in between the changes, the

0:11

systems,

0:12

the jobs?

0:13

Right.

0:14

It's complicated.

0:15

Especially now, everything's so connected that data flow can be, well, pretty

0:19

chaotic.

0:19

It really can.

0:20

And when something breaks down or, say, a compliance report gets questioned.

0:25

Oh, yeah.

0:26

Then data teams are scrambling, trying to figure out where things went wrong across

0:29

maybe dozens of different tools.

0:31

It's like chasing ghosts.

0:32

A real productivity killer, I bet.

0:34

Absolutely.

0:35

Yeah.

0:36

That operational friction is huge.

0:37

Well, today we're diving into something designed to bring some order to all that

0:41

chaos.

0:41

Open lineage and this deep dive.

0:43

It's really for you, the learner.

0:44

We want to give you a clear starting point on what data lineage actually is, why we

0:48

need

0:48

an open standard for it, and how open lineage tackles some really big industry

0:53

problems.

0:54

And before we jump in, we want to thank the supporter of this deep dive.

0:57

Safe Server commits to hosting this software and supports you in your digital

1:03

transformation.

1:05

More information under www.safeserver.de.

1:08

So open lineage really comes at a critical time.

1:11

The sources we looked at, they call it the foundation for a new generation of

1:15

powerful

1:16

context-aware data tools.

1:18

Context-aware.

1:19

Interesting.

1:20

Yeah.

1:21

It's about collecting metadata consistently.

1:22

If you can do that, it fundamentally changes how people understand their data, how

1:26

they

1:27

trust it, how they see the impact of changes.

1:29

We're sort of moving away from siloed guesswork.

1:32

Towards a kind of universal data history.

1:34

Exactly.

1:35

That's the goal.

1:36

Okay.

1:37

Let's unpack this a bit.

1:38

Before we get into all the technical nuts and bolts, maybe we should just quickly

1:39

define

1:39

data lineage itself.

1:40

Good idea.

1:41

Keep it simple.

1:42

Right.

1:43

So stripping away the jargon, it's basically the traceable history of your data.

1:48

It tracks metadata about data sets, the jobs that process them, and the specific

1:53

times

1:53

those jobs ran the runs.

1:56

And the real goal there is simple.

1:58

If something breaks.

1:59

You can find the root cause, like which specific job run messed up the data.

2:04

Precisely.

2:05

Or, before you deploy a change, you can understand what downstream things might be

2:08

affected.

2:09

Okay.

2:10

So that capability sounds pretty essential, but you mentioned earlier it used to be

2:13

a

2:13

massive headache.

2:14

Oh, absolutely.

2:15

Before open lineage, the main pain points were, well, friction and fragility.

2:20

Think about a big company with maybe 50 different data projects.

2:24

Using all sorts of different tools, I imagine, Spark, Airflow, maybe some custom

2:28

stuff.

2:28

Exactly.

2:29

And they all needed lineage.

2:30

But here's the kicker.

2:32

Every single project had to build its own way of tracking it.

2:36

Wow.

2:37

So massive duplication of effort.

2:38

Huge duplication.

2:39

Teams were constantly writing custom code just to get basic tracking.

2:43

It was slow, expensive, and created a ton of technical debt.

2:46

I can just picture the maintenance nightmare.

2:48

What if they tried using some external tool to just, like, scrape the logs or

2:52

something?

2:52

Ah, yeah, that led to the other big problem, broken integrations, these external

2:57

tools.

2:58

They were often really brittle.

2:59

How so?

3:00

Well, say, Spark released a new version or even just changed how it logged things.

3:05

Poof, the external integration breaks.

3:08

Instantly.

3:09

Often, yeah.

3:10

So data teams would spend weeks just trying to catch up, fixing the integration,

3:14

which

3:14

meant...

3:15

They had no lineage data during that time, right, when they might need it most.

3:19

Exactly.

3:20

The lineage ended up being expensive, unreliable, and almost always incomplete.

3:24

It just wasn't working well.

3:26

Okay, so open lineage flips that script through collaboration.

3:29

That's the core idea.

3:30

Yeah.

3:31

Instead of 50 projects building 50 systems.

3:32

The effort is shared, baked right into the platforms themselves.

3:35

Yes, precisely.

3:37

By having a standard API, the exact format for the metadata integrations can be

3:41

written

3:41

once and built directly into tools like Spark or Airflow, like an internal module.

3:46

Ah, so in Spark update.

3:48

The lineage tracking updates with it.

3:50

No more constant catch up.

3:52

It dramatically cuts down that maintenance burden.

3:55

And I think you mentioned something interesting.

3:56

It instruments jobs as they run.

3:59

Yeah, that's a key aspect.

4:00

It's not trying to piece things together from logs after the fact.

4:03

Which means the metadata is captured in real time.

4:06

Exactly.

4:07

Yeah.

4:08

More accurate, more complete, especially for things like stream processing where

4:10

stuff is

4:11

happening constantly.

4:12

Okay, that makes sense.

4:13

Let's get into the standard itself then.

4:15

The definition is an open standard for metadata and lineage collection, built on a

4:19

generic

4:20

model.

4:21

What are the core pieces of that model we need to understand?

4:24

Right, so there are three core entities, the sort of nouns of the model.

4:27

They use consistent naming, which is important.

4:29

They define the what, the how, and the when.

4:32

Okay.

4:33

First, the data set.

4:34

That's the simplest, just the data being read or written.

4:37

It could be a database table, a file, a Kafka topic.

4:40

It doesn't matter.

4:41

It gets a consistent name.

4:42

Got it.

4:43

What's next?

4:44

Second, the job.

4:46

This is the defined task or process itself.

4:49

Think of it like the blueprint, like run daily sales report.

4:52

That definition.

4:53

A reusable blueprint.

4:54

Okay.

4:55

And the third, the run.

4:56

Yes, the run.

4:57

And this is crucial.

4:59

It separates the blueprint from the actual execution.

5:02

So the run is the specific instance when that job executed.

5:06

If your daily sales report, job runs every midnight.

5:09

The run is Tuesday's execution at 12 00 00 01 AM.

5:15

Why is that separation so important?

5:16

For debugging.

5:17

If a dashboard is wrong, you need to know.

5:20

Did the job definition, the code change or did this specific run fail because of

5:23

something

5:24

else like bad input data?

5:25

Yeah.

5:26

That distinction is key.

5:27

Makes perfect sense.

5:28

Okay.

5:29

So run job data set.

5:30

That gives us the basics, but you said it handles all sorts of environments, right?

5:33

ML, SQL.

5:34

How does it get that flexibility?

5:35

You mentioned something called a facet.

5:36

Yes.

5:37

Facets.

5:38

That's where real power and extensibility comes in.

5:40

So what is a facet?

5:42

The sources call it an atomic piece of metadata.

5:45

Think of it like a standardized label or a little packet of context that you attach

5:50

to

5:50

one of those core entities, a run, a job, or a data set.

5:54

An atomic piece, like a building block.

5:56

Exactly.

5:57

The core model is simple on purpose, but it's extensible because you can define

6:01

these specific

6:02

facets to add rich, relevant details.

6:05

Okay.

6:06

Give me an example.

6:07

Let's say a job uses Python.

6:09

Right.

6:10

You could attach a facet to the run entity that lists the exact Python library

6:14

versions

6:15

used during that specific run.

6:16

Or for a data set.

6:17

You could have a schema facet detailing the columns and types.

6:21

Or for a job, maybe a source code facet with the git commit hash that defined the

6:25

job logic

6:26

for that run.

6:27

I see.

6:28

So you can add very specific context column details, performance stats, code

6:32

versions

6:32

without bloating the core standard itself.

6:34

Precisely.

6:35

It allows the standard to adapt to new technologies and specific needs without

6:39

breaking that fundamental

6:40

model.

6:41

And you mentioned it uses OpenAPI.

6:43

How does that help?

6:44

Well, OpenAPI gives you a structured way to define APIs.

6:48

Because the open lineage standard and its facets are defined using OpenAPI, it

6:52

makes

6:52

it really easy for the community to propose, review, and add new custom facets for

6:57

specific

6:58

use cases.

6:59

Ah.

7:00

So even very specialized metadata can fit neatly into the standard structure.

7:04

Exactly.

7:05

So everything's consistent but also flexible.

7:07

So connecting this back to you, the learner, that structure core entities plus extensible

7:12

facets is why open lineage is potentially so powerful.

7:16

It gives you that deep context when you need it for any kind of data task while

7:20

maintaining

7:21

that universal language for tracking.

7:23

Right.

7:24

Structured but flexible.

7:25

OK.

7:26

So the design sounds solid.

7:27

But a standard is only good if people actually use it, right?

7:29

Open lineage is an LF, AI, and data foundation graduate project.

7:34

What does that mean, and what does real world adoption look like?

7:38

The graduate status is basically a seal of approval from the Linux Foundation.

7:41

It means the project is mature, well-governed, and has significant community

7:45

backing.

7:46

It's a big deal.

7:47

OK.

7:48

And the adoption.

7:49

How's the integration matrix looking?

7:50

It's actually pretty impressive, which tells you the pain point was real.

7:54

People are definitely implementing it.

7:55

What kind of details does the matrix track?

7:57

Generally, it looks at two main levels of detail.

8:00

First, table-level lineage.

8:02

So basically, did this job read from or write to this table?

8:07

Simple connection.

8:08

And the second.

8:09

The much more detailed column-level lineage.

8:12

Did this job use this specific column in this table to produce that specific column

8:17

in another

8:17

table?

8:18

Ah, that's the really granular stuff you need for compliance or impact analysis.

8:22

Exactly.

8:23

That's the gold standard.

8:24

So walk us through some highlights.

8:25

Who's integrated?

8:26

What level do they support?

8:27

OK.

8:28

Let's start with Apache Spark, huge player in big data.

8:31

Right.

8:32

They use both table-level and column-level lineage, which is great.

8:36

But there's always a but, isn't there?

8:38

Usually.

8:39

There's a key caveat.

8:40

Yeah.

8:41

It currently doesn't support column-level lineage if you use select queries with gdbc

8:45

data sources.

8:46

Interesting.

8:47

So if you want that detailed column tracking...

8:48

You have to explicitly list the columns you're selecting, which honestly is

8:52

probably bad

8:53

practice anyway.

8:54

Good point.

8:55

It nudges you towards better coding.

8:57

What about workflow tools, like Airflow?

9:01

Apache Airflow, and yeah, probably the most popular orchestrator.

9:04

It also supports both table and column-level lineage.

9:07

It can integrate directly with many SQL operators, so Airflow itself reports what

9:11

its tasks

9:12

are doing.

9:13

Nice.

9:14

And newer tools, like dbt.

9:15

Yep.

9:16

dbt is natively integrated too, supporting both levels.

9:19

Crucial for tracking those complex multi-stage data models people build with it.

9:23

Okay.

9:24

What about real-time stream processing?

9:26

Good question.

9:27

Flink, a major stream processing engine, currently supports table-level lineage.

9:32

But not column-level yet.

9:33

Not yet, according to the sources.

9:35

It highlights that instrumenting those continuous real-time flows is technically

9:39

challenging.

9:39

Yep.

9:40

But they're starting with the most critical info, which data sources are involved.

9:43

So the adoption is growing, covering key platforms.

9:46

Definitely.

9:47

And this standardization of collecting the metadata is enabling a whole ecosystem

9:51

to

9:51

build up around it.

9:52

Right.

9:53

The sources mention related projects.

9:54

Tell us about those.

9:55

How do they fit in?

9:56

Okay.

9:57

So the first is Marquez.

9:58

Marquez.

9:59

Yeah.

10:00

If Open Lineage defines the standard for creating the lineage metadata, think of

10:04

Marquez as

10:05

the reference implementation for using it.

10:07

Oh, okay.

10:08

So what does it do?

10:09

It's another LFAI and data project.

10:11

Its job is to collect all that Open Lineage metadata, aggregate it, store it, and

10:15

crucially

10:16

provide ways to visualize it.

10:18

It gives you a UI to actually explore your data ecosystem's lineage.

10:23

The display case for the metadata Open Lineage collects.

10:26

Nice.

10:27

And the other one, Igeria.

10:28

Igeria plays a slightly different role.

10:30

It's more focused on broader open metadata management and governance across an

10:34

entire

10:34

enterprise.

10:35

How does it relate to Open Lineage?

10:37

It acts like a broker or exchange.

10:39

Igeria aims to automatically capture and share metadata between all sorts of

10:43

different tools

10:44

and platforms, whatever the vendor.

10:46

So Open Lineage provides the detailed, trusted lineage data, and Igeria helps

10:50

integrate that

10:51

data into your company-wide governance, discovery, and security tools.

10:55

I see.

10:56

So Open Lineage focuses on generating the core lineage events.

10:59

Marquez provides a way to see and explore that lineage.

11:02

And Igeria helps connect that lineage information into the bigger enterprise

11:05

picture.

11:06

You got it.

11:07

Together, they're building this foundational layer for universal data observability.

11:13

It's moving things away from being locked into one vendor's proprietary system.

11:17

Making lineage more open, more democratized.

11:19

Exactly.

11:20

Okay.

11:21

We've covered a lot of ground.

11:22

To summarize for everyone listening, open lineage is this vital open standard.

11:27

It shifts that heavy linting of instrumentation away from individual teams.

11:32

Shared effort.

11:33

And enables consistent detailed tracking of data runs, jobs, and data sets.

11:38

And it does this using that flexible model with core entities and those extensible

11:42

facets.

11:43

Right.

11:44

And the adoption is real.

11:45

That integration matrix shows it's getting picked up by major platforms.

11:48

Yeah.

11:49

And so for the learner, the big takeaway is that having this standardized shared

11:53

metadata

11:53

governed by projects like open lineage, it really changes the game for data trust.

11:58

How so?

11:59

Well, for security, for audits, for just basic operational efficiency.

12:03

Knowing the true history of your data becomes feasible at scale.

12:06

It's fundamental for building reliable data systems.

12:09

So here's a final thought to leave you with.

12:11

Imagine if every single piece of data processing, whether it's a simple query or a

12:15

super complex

12:16

ML model training.

12:18

What if it could automatically report its own complete verifiable history using

12:23

open

12:23

lineage?

12:24

What kinds of new applications could that unlock?

12:27

Think about automated governance or real time security checks or maybe even new

12:31

forms of

12:32

business intelligence based on process history.

12:35

Yeah.

12:36

If the data reports itself, that opens up a lot of possibilities.

12:39

The future of data might just be self-reporting.

12:41

A provocative thought indeed.

12:43

Thank you for joining us on this deep dive into open lineage.

13:01

We hope this helps you see not just the data, but the flow, the connections, the

13:04

behind it.

13:04

behind it.