Today's Deep-Dive: OpenLineage
Ep. 275

Today's Deep-Dive: OpenLineage

Episode description

The deep dive the complexities of data flow and the challenges data teams face in tracking and managing data lineage. It introduces OpenLineage as a solution to bring order to the chaotic data journey. OpenLineage is described as an open standard for collecting metadata, which helps in understanding data history, trusting data, and seeing the impact of changes. The text defines data lineage as the traceable history of data, tracking metadata about datasets, jobs, and their execution times. Before OpenLineage, tracking data lineage was a massive headache due to duplication of effort, fragile integrations, and incomplete data. OpenLineage addresses these issues through collaboration, sharing the effort across platforms, and capturing metadata in real-time. The standard uses a flexible model with core entities (dataset, job, run) and extensible facets for detailed metadata. The text also highlights real-world adoption, mentioning integrations with major platforms like Apache Spark, Airflow, and dbt. Additionally, it discusses related projects like Marquez and Igeria, which help visualize and integrate lineage data. This episode concludes by emphasizing the potential of OpenLineage in enabling data trust, security, and new applications.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now for 1 Euro - 30 days free!

No chapters are available for this episode.