Ep. 275

Today's Deep-Dive: OpenLineage

Oct 30, 2025

• 13min 06s

Episode description

The deep dive the complexities of data flow and the challenges data teams face in tracking and managing data lineage. It introduces OpenLineage as a solution to bring order to the chaotic data journey. OpenLineage is described as an open standard for collecting metadata, which helps in understanding data history, trusting data, and seeing the impact of changes. The text defines data lineage as the traceable history of data, tracking metadata about datasets, jobs, and their execution times. Before OpenLineage, tracking data lineage was a massive headache due to duplication of effort, fragile integrations, and incomplete data. OpenLineage addresses these issues through collaboration, sharing the effort across platforms, and capturing metadata in real-time. The standard uses a flexible model with core entities (dataset, job, run) and extensible facets for detailed metadata. The text also highlights real-world adoption, mentioning integrations with major platforms like Apache Spark, Airflow, and dbt. Additionally, it discusses related projects like Marquez and Igeria, which help visualize and integrate lineage data. This episode concludes by emphasizing the potential of OpenLineage in enabling data trust, security, and new applications.

https://openlineage.io/

No chapters are available for this episode.

Episode description

Persons