1 00:00:00,000 --> 00:00:03,180 If you've ever found yourself staring at a dashboard, 2 00:00:03,180 --> 00:00:06,960 you know, the one showing just garbage data all of a sudden, 3 00:00:06,960 --> 00:00:09,740 you know, that panic, you realize your core data pipeline 4 00:00:09,740 --> 00:00:12,780 just broke somewhere, and then trying to trace back 5 00:00:12,780 --> 00:00:15,540 which transformation job caused it, 6 00:00:15,540 --> 00:00:20,540 it feels like you're navigating this huge maze in the dark. 7 00:00:20,540 --> 00:00:23,300 Yeah, that feeling, being totally exposed. 8 00:00:23,300 --> 00:00:25,500 Right, that's precisely the problem 9 00:00:25,500 --> 00:00:28,720 that's standardizing data observability is trying to solve. 10 00:00:28,720 --> 00:00:31,580 Because as these data systems get bigger and more complex, 11 00:00:31,580 --> 00:00:33,820 tracking the data's whole life cycle, 12 00:00:33,820 --> 00:00:35,300 I mean, where it started, 13 00:00:35,300 --> 00:00:38,300 every single transformation step who used it, 14 00:00:38,300 --> 00:00:40,220 it's not just theoretical anymore. 15 00:00:40,220 --> 00:00:43,140 It's absolutely critical for operations, 16 00:00:43,140 --> 00:00:44,860 for trust, for everything. 17 00:00:44,860 --> 00:00:47,680 Absolutely, and well, that's our mission for today, really. 18 00:00:47,680 --> 00:00:50,500 We're taking a deep dive into open lineage. 19 00:00:50,500 --> 00:00:53,340 It's the open standard design to finally bring some order 20 00:00:53,340 --> 00:00:56,220 to that chaos we were talking about. 21 00:00:56,220 --> 00:01:00,060 We really wanna give you a clear, beginner-friendly way in 22 00:01:00,060 --> 00:01:02,300 to understand how this standard works, why it matters, 23 00:01:02,300 --> 00:01:05,340 and how it changes the game for data integrity. 24 00:01:05,340 --> 00:01:06,660 Yeah, it's foundational stuff. 25 00:01:06,660 --> 00:01:07,500 It really is. 26 00:01:07,500 --> 00:01:09,540 It's kind of the shortcut to understanding 27 00:01:09,540 --> 00:01:12,220 how your data is actually born and transformed. 28 00:01:12,220 --> 00:01:15,540 And we've looked at the core specs, the adoption docs. 29 00:01:15,540 --> 00:01:18,700 The goal is you walk away really understanding 30 00:01:18,700 --> 00:01:22,220 this key piece of the modern data world. 31 00:01:22,220 --> 00:01:23,060 Okay, cool. 32 00:01:23,060 --> 00:01:25,020 Now, before we jump into defining the problem 33 00:01:25,020 --> 00:01:26,780 that Open Lineage actually solves, 34 00:01:26,780 --> 00:01:29,060 just a quick note that this deep dive 35 00:01:29,060 --> 00:01:30,620 is supported by Safe Server. 36 00:01:30,620 --> 00:01:32,060 Ah, good point. 37 00:01:32,060 --> 00:01:34,160 Safe Server focuses on hosting software 38 00:01:34,160 --> 00:01:36,200 and helping out with your digital transformation. 39 00:01:36,200 --> 00:01:38,340 You can find more info and resources 40 00:01:38,340 --> 00:01:41,180 over at www.safeserver.de. 41 00:01:41,180 --> 00:01:42,020 Check them out. 42 00:01:42,020 --> 00:01:42,840 Definitely. 43 00:01:42,840 --> 00:01:44,580 So let's start at the beginning then. 44 00:01:44,580 --> 00:01:48,540 Before Open Lineage, what was the status quo? 45 00:01:48,540 --> 00:01:50,820 I mean, how bad was the pain point that made everyone say, 46 00:01:50,820 --> 00:01:53,700 okay, we need a common standard like right now? 47 00:01:53,700 --> 00:01:55,540 Well, the pain point was pretty simple, actually. 48 00:01:55,540 --> 00:02:00,540 It was duplication, cost, and things breaking all the time. 49 00:02:00,540 --> 00:02:02,860 Every company that needed data lineage, 50 00:02:02,860 --> 00:02:04,760 which is basically everyone, let's be honest, 51 00:02:04,760 --> 00:02:06,700 had to build it themselves from scratch. 52 00:02:06,700 --> 00:02:09,300 So every single project had to instrument, 53 00:02:09,300 --> 00:02:12,240 set up the tracking for all of its jobs individually. 54 00:02:12,240 --> 00:02:15,680 Yeah, so if you had say 10 different data tools, 55 00:02:15,680 --> 00:02:17,720 five orchestration layers, 56 00:02:17,720 --> 00:02:18,740 you were trying to maintain 57 00:02:18,740 --> 00:02:21,780 like 15 separate custom tracking solutions. 58 00:02:21,780 --> 00:02:24,300 That just sounds like a constant grinding drain 59 00:02:24,300 --> 00:02:25,140 on engineering time. 60 00:02:25,140 --> 00:02:27,380 It was, a massive maintenance liability 61 00:02:27,380 --> 00:02:29,220 because those custom tracking things, 62 00:02:29,220 --> 00:02:31,300 they were external to the actual data tools 63 00:02:31,300 --> 00:02:32,900 like Spark or Airflow. 64 00:02:32,900 --> 00:02:34,100 Oh, okay, so not built in. 65 00:02:34,100 --> 00:02:34,940 Not built in. 66 00:02:34,940 --> 00:02:37,340 They relied on specific internal APIs, 67 00:02:37,340 --> 00:02:39,140 often undocumented ones. 68 00:02:39,140 --> 00:02:42,020 So the moment the tool underneath, say Spark, 69 00:02:42,020 --> 00:02:44,380 released a new version, poof, 70 00:02:44,380 --> 00:02:47,060 your custom lineage script just broke. 71 00:02:47,060 --> 00:02:47,900 Nightwear. 72 00:02:47,900 --> 00:02:50,380 Constantly playing catch up, spending thousands, 73 00:02:50,380 --> 00:02:53,620 literally just to keep your basic visibility running, 74 00:02:53,620 --> 00:02:54,740 it was not sustainable. 75 00:02:54,740 --> 00:02:57,140 Okay, so open lineage comes along 76 00:02:57,140 --> 00:02:59,180 and completely flips that script, right? 77 00:02:59,180 --> 00:03:00,500 Instead of every company building 78 00:03:00,500 --> 00:03:03,520 these fragile external scripts, 79 00:03:03,520 --> 00:03:06,980 the effort is shared across the community. 80 00:03:06,980 --> 00:03:07,820 Exactly. 81 00:03:07,820 --> 00:03:09,540 That's the beauty of an open standard. 82 00:03:09,540 --> 00:03:10,540 And the really elegant part. 83 00:03:10,540 --> 00:03:11,740 Now, the integration itself 84 00:03:11,740 --> 00:03:13,960 can be pushed inside each project. 85 00:03:13,960 --> 00:03:15,940 So instead of some external script 86 00:03:15,940 --> 00:03:18,700 trying to like spy on a pipeline, 87 00:03:18,700 --> 00:03:20,540 the pipeline component itself 88 00:03:20,540 --> 00:03:23,380 speaks the lineage language, natively. 89 00:03:23,380 --> 00:03:24,200 So it's embedded. 90 00:03:24,200 --> 00:03:25,040 It's embedded. 91 00:03:25,040 --> 00:03:26,080 The collection is intrinsic. 92 00:03:26,080 --> 00:03:27,660 So you stop worrying about versions 93 00:03:27,660 --> 00:03:28,900 breaking your tracking logic. 94 00:03:28,900 --> 00:03:30,460 It's just there. 95 00:03:30,460 --> 00:03:31,460 It makes so much sense. 96 00:03:31,460 --> 00:03:32,820 And you know, for context, 97 00:03:32,820 --> 00:03:35,580 this isn't some small side project. 98 00:03:35,580 --> 00:03:37,740 Open lineage is an LFAI 99 00:03:37,740 --> 00:03:40,180 and data foundation graduate project. 100 00:03:40,180 --> 00:03:42,280 That means it's recognized, it's battle tested, 101 00:03:42,280 --> 00:03:44,060 it's a proper industry standard. 102 00:03:44,060 --> 00:03:44,900 Got it. 103 00:03:44,900 --> 00:03:47,060 And fundamentally, this data lineage, 104 00:03:47,060 --> 00:03:48,660 it provides the foundation. 105 00:03:48,660 --> 00:03:52,540 It lets you build these powerful context-aware data tools 106 00:03:52,540 --> 00:03:54,100 because it tracks all that metadata 107 00:03:54,100 --> 00:03:56,580 about data sets, jobs, runs. 108 00:03:56,580 --> 00:03:57,900 And at that deeper understanding, 109 00:03:57,900 --> 00:03:59,420 you can pinpoint the root cause 110 00:03:59,420 --> 00:04:01,540 of complex problems much faster. 111 00:04:01,540 --> 00:04:04,300 And crucially, you can understand the impact of changes 112 00:04:04,300 --> 00:04:05,620 before you make them, 113 00:04:05,620 --> 00:04:08,060 before you accidentally break something downstream. 114 00:04:08,060 --> 00:04:10,100 That really highlights the operational win, doesn't it? 115 00:04:10,100 --> 00:04:11,380 Reducing that oops factor. 116 00:04:11,380 --> 00:04:12,300 Exactly. 117 00:04:12,300 --> 00:04:13,860 Fewer oops moments. 118 00:04:13,860 --> 00:04:16,680 Okay, so we know why the standard exists now. 119 00:04:16,680 --> 00:04:19,040 Let's dig into what it actually defines. 120 00:04:19,040 --> 00:04:22,800 If open lineage is creating this shared language 121 00:04:22,800 --> 00:04:26,140 for data tracking, what are the basic words, 122 00:04:26,140 --> 00:04:26,980 the components? 123 00:04:26,980 --> 00:04:28,140 Yeah, good question. 124 00:04:28,140 --> 00:04:30,260 Think of open lineage as defining 125 00:04:30,260 --> 00:04:32,600 a kind of universal data passport. 126 00:04:32,600 --> 00:04:34,100 Okay, I like that, a passport. 127 00:04:34,100 --> 00:04:38,020 Yeah, and this passport dictates the consistent naming, 128 00:04:38,020 --> 00:04:41,040 the structure around three core things, 129 00:04:41,040 --> 00:04:43,600 three core entities that you'll find in any data flow. 130 00:04:43,600 --> 00:04:45,620 Those are the run, the job, and the data set. 131 00:04:45,620 --> 00:04:47,160 Okay, run, job, data set. 132 00:04:47,160 --> 00:04:50,080 So the run, that's like a specific execution 133 00:04:50,080 --> 00:04:51,160 of some process. 134 00:04:51,160 --> 00:04:53,600 Exactly, a single instance of a job running. 135 00:04:53,600 --> 00:04:56,840 And the job is the transformation logic itself, 136 00:04:56,840 --> 00:04:58,240 the code, the definition. 137 00:04:58,240 --> 00:05:00,000 That's right, the definition of the work to be done. 138 00:05:00,000 --> 00:05:02,280 And the data set is, well, the data, 139 00:05:02,280 --> 00:05:03,680 the input or the output asset. 140 00:05:03,680 --> 00:05:06,040 Precisely, the thing being read from or written to, 141 00:05:06,040 --> 00:05:09,400 those three are the absolute foundational building blocks. 142 00:05:09,400 --> 00:05:11,080 Okay, the minimum required pieces. 143 00:05:11,080 --> 00:05:13,260 They are, but the real magic, I think, 144 00:05:13,260 --> 00:05:16,200 the reason the standard can grow and adapt over time 145 00:05:16,200 --> 00:05:18,240 is this concept they call the facet. 146 00:05:18,240 --> 00:05:19,080 Okay, facet. 147 00:05:19,080 --> 00:05:21,080 So the data world is always changing, right? 148 00:05:21,080 --> 00:05:23,360 New regulations pop up, new kinds of tools, 149 00:05:23,360 --> 00:05:24,920 new transformations. 150 00:05:24,920 --> 00:05:27,560 How does a standard like open lineage 151 00:05:27,560 --> 00:05:29,160 avoid becoming obsolete? 152 00:05:29,160 --> 00:05:31,920 Is that where facets come in, like a plugin system? 153 00:05:31,920 --> 00:05:33,280 That's exactly it, you hit it. 154 00:05:33,280 --> 00:05:35,160 Facets are the plugin model. 155 00:05:35,160 --> 00:05:37,800 A facet is defined as an atomic, 156 00:05:37,800 --> 00:05:40,680 sort of self-contained piece of metadata. 157 00:05:40,680 --> 00:05:41,520 Atomic meaning. 158 00:05:41,520 --> 00:05:44,080 Meaning it describes one specific thing, 159 00:05:44,080 --> 00:05:48,080 and you can attach this facet to any of those core entities, 160 00:05:48,080 --> 00:05:50,520 the run, the job, or the data set. 161 00:05:50,520 --> 00:05:54,960 Ah, okay, so if run, job, and data set are the nouns, 162 00:05:54,960 --> 00:05:57,520 facets are like the descriptive adjectives 163 00:05:57,520 --> 00:05:58,400 you can stick onto them. 164 00:05:58,400 --> 00:05:59,960 That's a great way to put it, yes. 165 00:05:59,960 --> 00:06:02,600 And this is absolutely key for things like governance. 166 00:06:02,600 --> 00:06:03,440 How so? 167 00:06:03,440 --> 00:06:04,540 Well, you can attach things like, 168 00:06:04,540 --> 00:06:06,560 say, a regulatory compliance tag, 169 00:06:06,560 --> 00:06:09,020 maybe the GDPR status of a data set. 170 00:06:09,020 --> 00:06:11,600 Or you could attach a detailed schema fingerprint 171 00:06:11,600 --> 00:06:13,120 right onto the data set entity. 172 00:06:13,120 --> 00:06:13,960 Okay. 173 00:06:13,960 --> 00:06:16,680 Or maybe attach quality check results to the run entity, 174 00:06:16,680 --> 00:06:18,820 and because the whole specification is defined 175 00:06:18,820 --> 00:06:19,920 using OpenAPI. 176 00:06:19,920 --> 00:06:21,080 That standard API stuff. 177 00:06:21,080 --> 00:06:22,200 Exactly. 178 00:06:22,200 --> 00:06:24,960 Developers can extend the standard basically endlessly. 179 00:06:24,960 --> 00:06:27,160 They just define their own custom facets 180 00:06:27,160 --> 00:06:30,080 to track whatever proprietary details they need. 181 00:06:30,080 --> 00:06:32,840 This ensures the standard evolves with the industry 182 00:06:32,840 --> 00:06:33,900 and not behind it. 183 00:06:33,900 --> 00:06:36,460 That makes the whole system incredibly flexible. 184 00:06:36,460 --> 00:06:37,300 Yeah. 185 00:06:37,300 --> 00:06:38,120 Kind of future-proof, doesn't it? 186 00:06:38,120 --> 00:06:38,960 That's the goal, yeah. 187 00:06:38,960 --> 00:06:42,600 So we have these core entities, run, job, data set, 188 00:06:42,600 --> 00:06:45,760 and they can be decorated, enriched with these facets. 189 00:06:45,760 --> 00:06:49,480 How does OpenLineage actually capture and send 190 00:06:49,480 --> 00:06:51,720 this information around, especially with all the different 191 00:06:51,720 --> 00:06:52,720 tools people use? 192 00:06:52,720 --> 00:06:53,280 Right. 193 00:06:53,280 --> 00:06:57,400 So OpenLineage also defines a standard API specifically 194 00:06:57,400 --> 00:06:59,480 for capturing these lineage events. 195 00:06:59,480 --> 00:07:00,640 An API call, basically. 196 00:07:00,640 --> 00:07:01,520 Yeah. 197 00:07:01,520 --> 00:07:03,160 So the different pipeline components 198 00:07:03,160 --> 00:07:06,480 think your schedulers like Airflow, your data warehouses, 199 00:07:06,480 --> 00:07:08,320 your analysis tools, SQL engines, whatever. 200 00:07:08,320 --> 00:07:10,040 They use the standard API call. 201 00:07:10,040 --> 00:07:10,880 To report back. 202 00:07:10,880 --> 00:07:13,440 To send data about the runs, the jobs, the data 203 00:07:13,440 --> 00:07:15,360 sets, and any relevant facets. 204 00:07:15,360 --> 00:07:17,240 They package it up according to the standard 205 00:07:17,240 --> 00:07:20,720 and ship this event off to a compatible OpenLineage backend. 206 00:07:20,720 --> 00:07:21,480 Gotcha. 207 00:07:21,480 --> 00:07:23,320 It sounds like the essential plumbing needed 208 00:07:23,320 --> 00:07:25,240 to make all these different tools finally 209 00:07:25,240 --> 00:07:28,040 speak the same lineage language consistently. 210 00:07:28,040 --> 00:07:29,240 That's exactly what it is. 211 00:07:29,240 --> 00:07:30,280 It's the common language. 212 00:07:30,280 --> 00:07:34,000 And I saw the system allows for a configurable backend 213 00:07:34,000 --> 00:07:37,080 so users can choose how those events are sent, 214 00:07:37,080 --> 00:07:38,320 like which protocol. 215 00:07:38,320 --> 00:07:40,400 Yeah, that gives you flexibility in your architecture. 216 00:07:40,400 --> 00:07:42,520 You can choose how you want to receive and process 217 00:07:42,520 --> 00:07:43,240 those events. 218 00:07:43,240 --> 00:07:44,160 Makes sense. 219 00:07:44,160 --> 00:07:47,720 OK, now that we understand the how, let's talk adoption. 220 00:07:47,720 --> 00:07:49,700 Because looking around, this standard 221 00:07:49,700 --> 00:07:52,600 seems to be getting, well, pretty significant traction. 222 00:07:52,600 --> 00:07:53,480 It really is, yeah. 223 00:07:53,480 --> 00:07:55,800 What's the biggest challenge you see folks facing 224 00:07:55,800 --> 00:07:57,600 when they try to implement this? 225 00:07:57,600 --> 00:08:01,880 Is it getting started, doing the initial instrumentation? 226 00:08:01,880 --> 00:08:05,460 Or is it more about handling the sheer volume of metadata 227 00:08:05,460 --> 00:08:06,920 once it's flowing? 228 00:08:06,920 --> 00:08:08,440 That's a good question. 229 00:08:08,440 --> 00:08:10,080 The initial instrumentation effort, 230 00:08:10,080 --> 00:08:12,240 it's actually decreasing pretty rapidly now, 231 00:08:12,240 --> 00:08:14,560 thanks to all the community contributions, 232 00:08:14,560 --> 00:08:16,280 building integrations. 233 00:08:16,280 --> 00:08:20,200 The bigger challenge often is achieving true column level 234 00:08:20,200 --> 00:08:22,280 lineage, especially at scale. 235 00:08:22,280 --> 00:08:23,800 Column level versus table level. 236 00:08:23,800 --> 00:08:25,280 Can you quickly break that down? 237 00:08:25,280 --> 00:08:25,920 Sure. 238 00:08:25,920 --> 00:08:29,040 Table level lineage is basically knowing that, OK, data 239 00:08:29,040 --> 00:08:32,880 move from table A to table B. Useful, but limited. 240 00:08:32,880 --> 00:08:35,880 Column level lineage is knowing that this specific column 241 00:08:35,880 --> 00:08:39,360 in table A was used to calculate that specific column in table 242 00:08:39,360 --> 00:08:41,160 B. It's much more granular. 243 00:08:41,160 --> 00:08:43,840 It's like knowing not just that the package arrived, 244 00:08:43,840 --> 00:08:47,160 but exactly which truck carried the crucial piece of equipment 245 00:08:47,160 --> 00:08:48,480 inside that package. 246 00:08:48,480 --> 00:08:49,640 Ah, OK. 247 00:08:49,640 --> 00:08:52,720 So that lets you trace a specific calculation error 248 00:08:52,720 --> 00:08:56,120 or maybe a compliance issue with PII right back to the source 249 00:08:56,120 --> 00:08:56,400 column. 250 00:08:56,400 --> 00:08:56,960 Exactly. 251 00:08:56,960 --> 00:08:59,920 It's essential for that deep analysis and root cause 252 00:08:59,920 --> 00:09:00,440 finding. 253 00:09:00,440 --> 00:09:03,440 And we see that some heavy hitters, like Apache Spark 254 00:09:03,440 --> 00:09:06,800 and DBT, they're all in. They support both table level 255 00:09:06,800 --> 00:09:09,360 and that more granular column level lineage. 256 00:09:09,360 --> 00:09:11,760 Yeah, that strong adoption by major tools 257 00:09:11,760 --> 00:09:14,120 is absolutely critical for the standard success. 258 00:09:14,120 --> 00:09:16,800 However, there's always a nuance, right? 259 00:09:16,800 --> 00:09:19,720 Listeners should know that column level lineage, while super valuable, 260 00:09:19,720 --> 00:09:24,080 is also inherently complex to capture perfectly in all situations. 261 00:09:24,080 --> 00:09:29,040 So you will find some, let's say, edge cases or specific ways tools 262 00:09:29,040 --> 00:09:32,720 like Spark or Airflow handle certain complex SQL queries 263 00:09:32,720 --> 00:09:34,920 or maybe specific connectors. 264 00:09:34,920 --> 00:09:38,400 Well, like the note says, for Spark, sometimes tracking lineage 265 00:09:38,400 --> 00:09:42,560 through select queries that hit a JDBC source can be tricky. 266 00:09:42,560 --> 00:09:47,040 Or for Airflow, column level might work great for most SQL operators, 267 00:09:47,040 --> 00:09:50,320 but maybe not for a very specific BigQuery operator 268 00:09:50,320 --> 00:09:51,520 doing something complex. 269 00:09:51,520 --> 00:09:52,020 Got it. 270 00:09:52,020 --> 00:09:54,840 So the standard provides the map, but sometimes there 271 00:09:54,840 --> 00:09:57,540 are tricky intersections depending on the specific tool. 272 00:09:57,540 --> 00:09:58,340 That's a good way to put it. 273 00:09:58,340 --> 00:10:00,480 The standard provides the blueprint, yeah. 274 00:10:00,480 --> 00:10:04,240 But the devil can be in the integration details for each tool. 275 00:10:04,240 --> 00:10:06,200 The good news, though, is the community 276 00:10:06,200 --> 00:10:08,880 is super active in identifying and resolving 277 00:10:08,880 --> 00:10:10,480 these tool-specific exceptions. 278 00:10:10,480 --> 00:10:12,000 It's constantly improving. 279 00:10:12,000 --> 00:10:12,800 That's great to hear. 280 00:10:12,800 --> 00:10:15,640 It's really impressive how widely embraced it is becoming. 281 00:10:15,640 --> 00:10:18,880 But open lineage isn't the only name people hear in this space. 282 00:10:18,880 --> 00:10:21,360 How does it fit into the wider data ecosystem? 283 00:10:21,360 --> 00:10:24,680 Are there other key projects that work alongside it 284 00:10:24,680 --> 00:10:26,320 or maybe depend on it? 285 00:10:26,320 --> 00:10:27,480 Yeah, definitely. 286 00:10:27,480 --> 00:10:29,680 It's helpful to think of the ecosystem here. 287 00:10:29,680 --> 00:10:33,240 Open lineage, as we said, defines the standard format, 288 00:10:33,240 --> 00:10:36,360 the language, the electrical plug, if you like, 289 00:10:36,360 --> 00:10:37,200 for data lineage. 290 00:10:37,200 --> 00:10:38,240 OK, the standard plug. 291 00:10:38,240 --> 00:10:40,240 It guarantees the format and structure 292 00:10:40,240 --> 00:10:41,800 of the metadata signal. 293 00:10:41,800 --> 00:10:43,560 Now, the question is, what do you plug 294 00:10:43,560 --> 00:10:45,600 in to that standard outlet? 295 00:10:45,600 --> 00:10:47,960 Right, so tell us about Marquez. 296 00:10:47,960 --> 00:10:49,000 That name comes up a lot. 297 00:10:49,000 --> 00:10:52,600 Marquez is essentially the reference implementation 298 00:10:52,600 --> 00:10:54,040 of the open lineage API. 299 00:10:54,040 --> 00:10:57,000 Think of it as the back end service and the UI 300 00:10:57,000 --> 00:10:58,120 you plug into the wall. 301 00:10:58,120 --> 00:10:58,600 OK. 302 00:10:58,600 --> 00:11:01,440 It focuses on collecting all those open lineage events, 303 00:11:01,440 --> 00:11:04,080 aggregating them, storing the history, 304 00:11:04,080 --> 00:11:05,680 and then visualizing the metadata. 305 00:11:05,680 --> 00:11:08,600 It gives you that dashboard view of your lineage. 306 00:11:08,600 --> 00:11:11,280 So open lineage provides the raw data feed. 307 00:11:11,280 --> 00:11:14,160 Marquez helps you actually see it and explore the history. 308 00:11:14,160 --> 00:11:14,720 Exactly. 309 00:11:14,720 --> 00:11:15,880 Open lineage is the language. 310 00:11:15,880 --> 00:11:18,640 Marquez helps you read the story told in that language. 311 00:11:18,640 --> 00:11:19,240 Got it. 312 00:11:19,240 --> 00:11:21,680 And then there's another project, Egeria. 313 00:11:21,680 --> 00:11:22,720 Where does that fit in? 314 00:11:22,720 --> 00:11:24,160 Is it similar to Marquez? 315 00:11:24,160 --> 00:11:25,480 Egeria is a bit different. 316 00:11:25,480 --> 00:11:27,120 Think bigger picture. 317 00:11:27,120 --> 00:11:29,960 If Marquez is the visualizer plugged into the open lineage 318 00:11:29,960 --> 00:11:33,640 outlet, Egeria is more like the central switchboard 319 00:11:33,640 --> 00:11:36,240 for your entire enterprise's metadata. 320 00:11:36,240 --> 00:11:37,080 The switchboard. 321 00:11:37,080 --> 00:11:37,560 OK. 322 00:11:37,560 --> 00:11:40,320 It offers open metadata and governance capabilities 323 00:11:40,320 --> 00:11:42,280 across the whole organization. 324 00:11:42,280 --> 00:11:44,840 It's designed to automatically capture, manage, 325 00:11:44,840 --> 00:11:47,200 and importantly, exchange metadata 326 00:11:47,200 --> 00:11:49,520 between lots of different tools and platforms, 327 00:11:49,520 --> 00:11:51,160 regardless of the vendor. 328 00:11:51,160 --> 00:11:52,280 So it connects things. 329 00:11:52,280 --> 00:11:53,000 Yeah. 330 00:11:53,000 --> 00:11:57,280 So open lineage collects the raw standardized lineage data. 331 00:11:57,280 --> 00:11:59,840 Marquez can visualize that specific data. 332 00:11:59,840 --> 00:12:02,040 Egeria can take that traceable lineage data 333 00:12:02,040 --> 00:12:04,960 from open lineage and other metadata sources 334 00:12:04,960 --> 00:12:07,680 and share it intelligently across your entire governance 335 00:12:07,680 --> 00:12:10,840 system, maybe feeding it to risk management tools or data 336 00:12:10,840 --> 00:12:13,200 catalogs or data science platforms. 337 00:12:13,200 --> 00:12:15,800 It helps integrate lineage into broader processes. 338 00:12:15,800 --> 00:12:17,040 OK, that makes sense. 339 00:12:17,040 --> 00:12:18,800 Open lineage for the standard feed, 340 00:12:18,800 --> 00:12:21,280 Marquez for visualization and history, 341 00:12:21,280 --> 00:12:24,280 Egeria for broader enterprise metadata management 342 00:12:24,280 --> 00:12:25,560 and governance integration. 343 00:12:25,560 --> 00:12:26,240 You got it. 344 00:12:26,240 --> 00:12:27,920 They complement each other nicely. 345 00:12:27,920 --> 00:12:30,280 And it's clearly a vibrant ecosystem 346 00:12:30,280 --> 00:12:31,200 developing around this. 347 00:12:31,200 --> 00:12:35,480 I mean, looking at the community stats for open lineage itself, 348 00:12:35,480 --> 00:12:37,160 it's not just a theoretical paper, right? 349 00:12:37,160 --> 00:12:39,040 People are actively using this. 350 00:12:39,040 --> 00:12:44,440 2.1 thousand stars on GitHub, nearly 400 forks. 351 00:12:44,440 --> 00:12:45,960 That's real activity. 352 00:12:45,960 --> 00:12:47,880 That activity is crucial, absolutely. 353 00:12:47,880 --> 00:12:49,600 It shows it's solving real problems. 354 00:12:49,600 --> 00:12:51,520 And look at the primary languages being used 355 00:12:51,520 --> 00:12:56,920 for the core project, Java at over 60%, Python around 25%. 356 00:12:56,920 --> 00:13:00,080 That mix perfectly mirrors the modern data stack, doesn't it? 357 00:13:00,080 --> 00:13:03,560 You often have execution engines like Spark running on the JVM, 358 00:13:03,560 --> 00:13:05,760 Java, and then orchestration layers like Airflow 359 00:13:05,760 --> 00:13:07,640 heavily using Python. 360 00:13:07,640 --> 00:13:10,120 So that language mix ensures it fits naturally 361 00:13:10,120 --> 00:13:11,800 into the places where lineage actually 362 00:13:11,800 --> 00:13:14,640 needs to be generated, broad applicability. 363 00:13:14,640 --> 00:13:15,200 Fantastic. 364 00:13:15,200 --> 00:13:17,600 OK, so to kind of bring this all back home, 365 00:13:17,600 --> 00:13:20,080 the core takeaway here seems to be that open lineage is really 366 00:13:20,080 --> 00:13:21,560 this essential open standard. 367 00:13:21,560 --> 00:13:25,040 It's transforming data tracking from this fragmented, custom 368 00:13:25,040 --> 00:13:25,640 built mess. 369 00:13:25,640 --> 00:13:26,520 Yeah, the old way. 370 00:13:26,520 --> 00:13:29,840 Into a consistent, shared framework 371 00:13:29,840 --> 00:13:31,640 for collecting lineage. 372 00:13:31,640 --> 00:13:34,160 It's letting all the different systems in your stack 373 00:13:34,160 --> 00:13:37,000 finally communicate lineage effectively, 374 00:13:37,000 --> 00:13:38,960 and hopefully killing off those maintenance 375 00:13:38,960 --> 00:13:39,960 nightmares of the past. 376 00:13:39,960 --> 00:13:42,860 That's the promise, and increasingly the reality. 377 00:13:42,860 --> 00:13:46,000 And if we connect this to the bigger picture for a second. 378 00:13:46,000 --> 00:13:46,560 Please do. 379 00:13:46,560 --> 00:13:49,280 What open lineage really means is that data observability, 380 00:13:49,280 --> 00:13:52,520 it's no longer this specialized, separate tool 381 00:13:52,520 --> 00:13:54,640 that you have to somehow bolt onto the outside 382 00:13:54,640 --> 00:13:55,880 of your systems. 383 00:13:55,880 --> 00:13:59,400 It's becoming a native function, an inherent capability that's 384 00:13:59,400 --> 00:14:02,880 getting embedded directly into every key piece of your data 385 00:14:02,880 --> 00:14:06,120 infrastructure, from Spark to Airflow to DBT. 386 00:14:06,120 --> 00:14:06,840 It's built in. 387 00:14:06,840 --> 00:14:08,340 That feels like a fundamental shift. 388 00:14:08,340 --> 00:14:10,560 It is, and it raises a really important question 389 00:14:10,560 --> 00:14:11,520 for the future, I think. 390 00:14:11,520 --> 00:14:14,200 When every data movement, every transformation 391 00:14:14,200 --> 00:14:17,000 is inherently traceable, because the tools speak 392 00:14:17,000 --> 00:14:20,960 open lineage natively, how will that standardized, 393 00:14:20,960 --> 00:14:24,880 built-in lineage fundamentally change things like compliance 394 00:14:24,880 --> 00:14:26,400 or automated data governance? 395 00:14:26,400 --> 00:14:29,000 What becomes possible then? 396 00:14:29,000 --> 00:14:30,760 That is a compelling thought. 397 00:14:30,760 --> 00:14:34,320 What happens when observability is just part of the fabric? 398 00:14:34,320 --> 00:14:36,000 Something to definitely chew on as you're 399 00:14:36,000 --> 00:14:38,540 planning your next data modernization project. 400 00:14:38,540 --> 00:14:40,160 Well, thank you so much for walking us 401 00:14:40,160 --> 00:14:40,560 through all that. 402 00:14:40,560 --> 00:14:41,320 Really helpful. 403 00:14:41,320 --> 00:14:41,920 My pleasure. 404 00:14:41,920 --> 00:14:42,920 Thanks for having me. 405 00:14:42,920 --> 00:14:45,440 And finally, a big thank you once again to SafeServer 406 00:14:45,440 --> 00:14:49,040 for supporting this deep dive into open lineage. 407 00:14:49,040 --> 00:14:50,960 SafeServer supports your hosting needs 408 00:14:50,960 --> 00:14:52,720 and digital transformation. 409 00:14:52,720 --> 00:14:56,320 You can find more info at www.safeserver.de. 410 00:14:56,320 --> 00:14:58,920 We'll catch you on the next deep dive.