1 00:00:00,000 --> 00:00:04,270 Welcome to the deep dive where we cut through the noise to give you the essentials 2 00:00:04,270 --> 00:00:05,440 fast today 3 00:00:05,440 --> 00:00:07,680 we're getting stuck into something really interesting actually a 4 00:00:07,680 --> 00:00:12,660 Pretty rebellious open source project shaking things up in the AI data world. We're 5 00:00:12,660 --> 00:00:15,840 talking about crawl for AI. It's a web crawler 6 00:00:15,840 --> 00:00:19,240 Yes, but it's specifically designed to take the wealth the chaos of the internet 7 00:00:19,240 --> 00:00:22,560 and turn it into clean structured data that large-language models 8 00:00:22,560 --> 00:00:24,560 Can actually use so our mission today? 9 00:00:25,060 --> 00:00:30,400 Simple we're gonna unpack tech docs the github buzz. I mean this thing has over 10 00:00:30,400 --> 00:00:31,540 fifty five thousand stars 11 00:00:31,540 --> 00:00:35,140 It's the most popular crawler out there right now and figure out what problem it 12 00:00:35,140 --> 00:00:38,200 really solves and importantly how its whole approach is built 13 00:00:38,200 --> 00:00:39,300 around LLMs 14 00:00:39,300 --> 00:00:42,840 We want this to be a really clear starting point for you, even if you're just 15 00:00:42,840 --> 00:00:44,640 dipping your toes into modern data pipelines 16 00:00:44,640 --> 00:00:48,400 But before we dive in a quick shout out to our supporter for this deep dive safe 17 00:00:48,400 --> 00:00:50,680 server safe server helps with hosting powerful software 18 00:00:50,680 --> 00:00:54,630 Like crawl for AI and can really support your digital transformation. You can find 19 00:00:54,630 --> 00:00:57,100 out more at www.safe-server.de 20 00:00:57,100 --> 00:01:01,470 Okay, let's set the scene if you're building anything with AI today, especially 21 00:01:01,470 --> 00:01:03,740 stuff like Argo retrieval augmented generation 22 00:01:03,740 --> 00:01:07,700 Right where the model needs to pull in outside info. Exactly. You hit this wall. 23 00:01:07,700 --> 00:01:10,340 Immediately. The web is just messy for machines 24 00:01:10,340 --> 00:01:14,530 Anyway, it's full of junk menus ads footers all that boilerplate feed that raw 25 00:01:14,530 --> 00:01:15,360 stuff to an LLM 26 00:01:15,440 --> 00:01:20,260 You're wasting tokens like crazy and well the answers you get back are great. That 27 00:01:20,260 --> 00:01:21,880 really is the heart of the problem 28 00:01:21,880 --> 00:01:26,160 It's not just about grabbing data anymore. It's about cleaning it structuring it as 29 00:01:26,160 --> 00:01:26,640 you grab it 30 00:01:26,640 --> 00:01:31,280 And that's where crawl for AI came from. It's open source a crawler and scraper 31 00:01:31,280 --> 00:01:35,860 Yeah, but its main job its whole philosophy is being LLM friendly 32 00:01:35,860 --> 00:01:41,720 The key output the revolutionary bit is that it turns the web into clean LLM ready 33 00:01:41,720 --> 00:01:43,320 markdown markdown 34 00:01:43,320 --> 00:01:46,200 That's the magic ingredient, isn't it? It's simple keeps the structure 35 00:01:46,200 --> 00:01:50,630 You know headers lists but ditches all the messy HTML and CSS that just inflates 36 00:01:50,630 --> 00:01:52,200 token counts and confuses the AI 37 00:01:52,200 --> 00:01:55,660 Precisely and the backstories kind of amazing actually 38 00:01:55,660 --> 00:02:00,160 Yeah, pure developer frustration really the founder who has a background in NLP 39 00:02:00,160 --> 00:02:04,800 was trying to scale up data gathering back in 2023 and 40 00:02:04,800 --> 00:02:09,240 Found the existing tools were just lacking 41 00:02:09,240 --> 00:02:12,360 They were either closed source and expensive or they pretended to be open source 42 00:02:12,360 --> 00:02:15,600 But then you needed accounts API tokens. Sometimes they were hidden fees 43 00:02:15,600 --> 00:02:20,810 It felt like a lock-in right blocking affordable access to do serious work. So what 44 00:02:20,810 --> 00:02:23,320 happened sounds like someone got pretty fired up 45 00:02:23,320 --> 00:02:26,940 Oh, yeah, he literally said he went into turbo anger mode. This wasn't some big 46 00:02:26,940 --> 00:02:28,360 corporate project. It was personal 47 00:02:28,360 --> 00:02:31,700 He built crawl for AI fast put it out there as open source for availability 48 00:02:31,700 --> 00:02:35,720 Meaning anyone could just grab it no strings attached with the goal of affordability 49 00:02:35,720 --> 00:02:40,480 The whole idea is democratizing knowledge making sure a structured text images 50 00:02:40,480 --> 00:02:41,920 metadata all prep for AI 51 00:02:41,920 --> 00:02:46,920 Isn't stuck on a paywall. Okay, that explains the massive github following that 52 00:02:46,920 --> 00:02:48,120 drive for openness 53 00:02:48,120 --> 00:02:52,380 But you know developers need tools that actually work not just ones with a good 54 00:02:52,380 --> 00:02:52,800 story 55 00:02:52,800 --> 00:02:58,060 So let's shift gears. What are the technical chops that make this thing stand out? 56 00:02:58,060 --> 00:03:03,070 Speed and control seem to be big selling points. Definitely performance and control 57 00:03:03,070 --> 00:03:03,880 are paramount 58 00:03:03,880 --> 00:03:07,800 They call it fast in practice and that comes down to smart design choices 59 00:03:08,320 --> 00:03:13,420 Like it uses an async browser pool for anyone new to that async just means it doesn't 60 00:03:13,420 --> 00:03:14,220 do things one by one 61 00:03:14,220 --> 00:03:17,340 It juggles hundreds of browser requests at the same time 62 00:03:17,340 --> 00:03:21,280 This cuts down waiting time uses your hardware way more efficiently, right? 63 00:03:21,280 --> 00:03:23,560 That makes sense a big efficiency boost right there 64 00:03:23,560 --> 00:03:27,390 But the web fights back doesn't it sites use JavaScript load things late and they're 65 00:03:27,390 --> 00:03:28,720 actively trying to block bots 66 00:03:28,720 --> 00:03:33,460 How does crawl for AI handle that minefield? That's where full browser control 67 00:03:33,460 --> 00:03:35,520 comes in. It's not pretending to be a browser 68 00:03:35,640 --> 00:03:39,850 It's actually driving real browser instances. It uses something called the Chrome 69 00:03:39,850 --> 00:03:41,600 developer tools protocol 70 00:03:41,600 --> 00:03:46,220 So it sees the page exactly like you would it runs the JavaScript waits for stuff 71 00:03:46,220 --> 00:03:46,920 to load in 72 00:03:46,920 --> 00:03:51,280 Handles those images that only appear when you scroll down lazy loading 73 00:03:51,280 --> 00:03:55,880 Okay, and it can even simulate scrolling down the page what they call full page 74 00:03:55,880 --> 00:03:59,000 scanning to grab content on those infinite scroll sites 75 00:03:59,000 --> 00:04:02,850 Okay, that's clever, but let's talk bot detection. That's the big headache for many 76 00:04:02,850 --> 00:04:04,860 people right? Yeah, you mentioned stealth mode 77 00:04:05,400 --> 00:04:08,760 Sounds great, but doesn't running a full browser trying to look human make 78 00:04:08,760 --> 00:04:10,460 everything much slower and heavier 79 00:04:10,460 --> 00:04:15,040 What's the real-world trade-off there for getting past Cloudflare or Akamai? That's 80 00:04:15,040 --> 00:04:16,460 a fair question. Absolutely 81 00:04:16,460 --> 00:04:19,910 There's always a trade-off running a full browser is more resource intensive than a 82 00:04:19,910 --> 00:04:21,100 simple request no doubt 83 00:04:21,100 --> 00:04:25,000 but crawl for AI tries to balance that with the async stuff and smart caching we 84 00:04:25,000 --> 00:04:25,840 mentioned and 85 00:04:25,840 --> 00:04:31,440 Honestly the benefit of stealth mode which uses configurations to mimic a real user 86 00:04:31,800 --> 00:04:35,760 Often outweighs the cost because the alternative is just getting blocked 87 00:04:35,760 --> 00:04:40,110 Failing the crawl completely plus it handles the practical things you need like 88 00:04:40,110 --> 00:04:42,200 using proxies managing sessions 89 00:04:42,200 --> 00:04:45,770 So you can stay logged in keeping cookies persistent if you need to scrape behind a 90 00:04:45,770 --> 00:04:47,540 login got it, so it's fast 91 00:04:47,540 --> 00:04:52,540 It's sneaky when it needs to be and it really controls the browser environment now 92 00:04:52,540 --> 00:04:56,880 Let's get to the AI part of crawl for AI the output is AI friendly clear 93 00:04:57,080 --> 00:05:01,510 But how does the crawler itself use intelligence not just to grab stuff, but to 94 00:05:01,510 --> 00:05:04,280 filter it maybe even learn this feels like the core innovation 95 00:05:04,280 --> 00:05:07,470 Yeah, this is where you see features really tailored for optimizing tokens and 96 00:05:07,470 --> 00:05:08,800 boosting our each performance 97 00:05:08,800 --> 00:05:10,800 It starts with how it generates the markdown 98 00:05:10,800 --> 00:05:15,450 You've got your basic clean markdown which just strips out the obvious HTML junk, 99 00:05:15,450 --> 00:05:15,560 right? 100 00:05:15,560 --> 00:05:19,560 But then there's fit markdown this uses heuristic based filtering. Okay heuristic 101 00:05:19,560 --> 00:05:20,400 filtering 102 00:05:20,400 --> 00:05:23,080 Can you break that down a bit for someone new to this? 103 00:05:23,080 --> 00:05:26,800 What does that actually mean for the data they get sure? 104 00:05:27,080 --> 00:05:29,520 Think of heuristics as smart rules of thumb 105 00:05:29,520 --> 00:05:34,840 The crawler uses these rules to guess what parts of a web page are probably useless 106 00:05:34,840 --> 00:05:37,600 You know navigation menus sidebars footers 107 00:05:37,600 --> 00:05:41,680 Maybe comment sections fit markdown tries to identify and just delete that stuff 108 00:05:41,680 --> 00:05:42,240 automatically 109 00:05:42,240 --> 00:05:47,690 Imagine cutting say 40% of the useless words the tokens from a page just by being 110 00:05:47,690 --> 00:05:49,360 smart about removing the boilerplate 111 00:05:49,360 --> 00:05:51,840 That saves you real money on LLM calls 112 00:05:51,880 --> 00:05:55,830 But more importantly, it makes your area G system way more accurate because the AI 113 00:05:55,830 --> 00:05:58,680 is only looking at the actual content the important stuff 114 00:05:58,680 --> 00:06:02,790 Right efficiency through intelligence makes sense. What if I have a really specific 115 00:06:02,790 --> 00:06:03,040 goal? 116 00:06:03,040 --> 00:06:06,870 Like I only want the q4 earnings numbers from a company site. How does it filter 117 00:06:06,870 --> 00:06:07,720 out all the other noise? 118 00:06:07,720 --> 00:06:12,480 For that kind of targeted crawl it can use the BM 25 algorithm. Yeah, I'm 20 119 00:06:12,480 --> 00:06:15,680 Yeah, it's a well-known ranking function from information retrieval 120 00:06:15,680 --> 00:06:20,040 Basically think of it as a sophisticated way to score how relevant a piece of text 121 00:06:20,040 --> 00:06:21,760 is to your specific search query 122 00:06:21,760 --> 00:06:24,740 So if you tell the crawler you're looking for 123 00:06:24,740 --> 00:06:26,340 q4 124 00:06:26,340 --> 00:06:31,720 2024 earnings report BM 25 helps ensure the final markdown focuses tightly on text 125 00:06:31,720 --> 00:06:32,980 related to those terms 126 00:06:32,980 --> 00:06:37,400 It helps ignore the CEOs blog post or the company picnic photos 127 00:06:37,400 --> 00:06:41,160 Okay. Now this is where it gets really cutting-edge based on the source docs 128 00:06:41,160 --> 00:06:43,400 adaptive crawling 129 00:06:43,400 --> 00:06:47,240 My understanding is this helps the crawler know when it's found enough information 130 00:06:47,240 --> 00:06:48,760 like when to stop 131 00:06:48,880 --> 00:06:53,000 Spot-on and that's huge for saving resources old-school crawlers 132 00:06:53,000 --> 00:06:57,320 Just follow links deeper and deeper until they hit some arbitrary limit super wasteful 133 00:06:57,320 --> 00:06:58,720 adaptive crawling is smarter 134 00:06:58,720 --> 00:07:03,090 It uses these advanced information forging algorithms fancy term that basically the 135 00:07:03,090 --> 00:07:05,400 crawler learns the site structure as it goes 136 00:07:05,400 --> 00:07:08,830 It's constantly asking is the new information I'm finding actually relevant to the 137 00:07:08,830 --> 00:07:09,680 original query 138 00:07:09,680 --> 00:07:12,860 Yeah, and it figures out when it's gathered enough information to likely answer 139 00:07:12,860 --> 00:07:15,520 that query based on a confidence level you can set 140 00:07:15,520 --> 00:07:20,330 So instead of blindly following 50 links and maybe only 10 were useful it might 141 00:07:20,330 --> 00:07:22,560 stop after 15 because it thinks okay 142 00:07:22,560 --> 00:07:27,240 I probably got what I need exactly that you set a threshold it hits it and that 143 00:07:27,240 --> 00:07:28,880 specific crawl job shuts down 144 00:07:28,880 --> 00:07:32,390 It's optimizing based on knowledge gathering not just link counting very smart 145 00:07:32,390 --> 00:07:35,860 efficiency and intelligence working together. Okay one more intelligence piece 146 00:07:35,860 --> 00:07:41,510 Structured data tables are vital right for databases training models, but huge 147 00:07:41,510 --> 00:07:43,360 tables can crash scrapers 148 00:07:43,360 --> 00:07:47,840 Yeah, memory limits are a classic problem crawl for AI tackles this with what they 149 00:07:47,840 --> 00:07:48,200 call 150 00:07:48,200 --> 00:07:54,440 Revolutionary LLM table extraction you can still use the old way CSS selectors XPath 151 00:07:54,440 --> 00:07:58,120 But it can also use LLMs directly for pulling out table data 152 00:07:58,120 --> 00:08:03,880 The clever part is intelligent chunking instead of trying to load a massive multi-page 153 00:08:03,880 --> 00:08:06,160 table into memory all at once which often fails 154 00:08:06,160 --> 00:08:08,760 It breaks the table into smaller manageable pieces 155 00:08:09,120 --> 00:08:13,320 uses the LLM to process each chunk extract the data clean it up and then stitches 156 00:08:13,320 --> 00:08:16,080 the results back together seamlessly it's built for 157 00:08:16,080 --> 00:08:20,860 Handling really big data sets that covers the what and how brilliantly so last 158 00:08:20,860 --> 00:08:22,640 piece deployment 159 00:08:22,640 --> 00:08:26,800 How easy is it for say a developer or a small team to actually get this thing 160 00:08:26,800 --> 00:08:29,280 running and plugged into their workflow? 161 00:08:29,280 --> 00:08:34,790 Well, the basic install is super easy if you use Python just pip install that you 162 00:08:34,790 --> 00:08:36,480 crawl for I standard stuff 163 00:08:36,760 --> 00:08:42,030 But they clearly built it knowing that real world large-scale use needs more than 164 00:08:42,030 --> 00:08:43,800 just running it on your laptop 165 00:08:43,800 --> 00:08:47,870 Which naturally leads to their docker setup. I imagine exactly the Docker I set up 166 00:08:47,870 --> 00:08:49,260 is really key for production 167 00:08:49,260 --> 00:08:53,720 It bundles everything up neatly you get a ready-to-go fast PI server for handling 168 00:08:53,720 --> 00:08:57,040 API requests is built in security with JWT tokens 169 00:08:57,040 --> 00:09:00,280 And it's designed to be deployed in the cloud and handle lots of crawl jobs 170 00:09:00,280 --> 00:09:04,340 Simultaneously and I saw something crucial in the latest updates webhooks that 171 00:09:04,340 --> 00:09:06,660 feels like a major quality of life improvement 172 00:09:06,660 --> 00:09:09,280 Right. No more constantly checking if a job is done 173 00:09:09,280 --> 00:09:13,480 Oh, absolutely gets rid of that tedious polling process the recent version added a 174 00:09:13,480 --> 00:09:16,640 full webhook system for the Docker job queue API 175 00:09:16,640 --> 00:09:22,680 So yeah, no more polling is the headline there crawl for AI can now actively tell 176 00:09:22,680 --> 00:09:23,640 your other systems 177 00:09:23,640 --> 00:09:28,990 When a crawl job is finished or when an LLM extraction task is complete. It sends 178 00:09:28,990 --> 00:09:30,640 out real-time notifications 179 00:09:30,640 --> 00:09:35,110 They even built in retry logic. So if you're receiving system hiccups, it won't 180 00:09:35,110 --> 00:09:35,680 just fail 181 00:09:35,680 --> 00:09:39,400 It makes integration much smoother that really brings it all together. Doesn't it 182 00:09:39,400 --> 00:09:42,820 back to that original mission building an independent powerful tool? 183 00:09:42,820 --> 00:09:46,790 That's genuinely accessible. They didn't just build a better crawler. They build a 184 00:09:46,790 --> 00:09:49,500 whole transparent ecosystem for getting data 185 00:09:49,500 --> 00:09:51,880 Absolutely. Yeah, their mission is clear about that 186 00:09:51,880 --> 00:09:56,480 Fostering a shared data economy making sure AI gets fed by real human knowledge 187 00:09:56,480 --> 00:09:59,880 staying transparent that tiered sponsorship program 188 00:09:59,880 --> 00:10:03,550 They have is specifically to keep the core project free and independent true to 189 00:10:03,550 --> 00:10:05,880 that original rebellion against walled gardens 190 00:10:05,880 --> 00:10:11,310 It really is a great summary of crawl for AI then a powerful open source way to 191 00:10:11,310 --> 00:10:13,100 bridge the gap between the messy web 192 00:10:13,100 --> 00:10:18,660 And what modern AI actually needs structured clean data driven by speed control and 193 00:10:18,660 --> 00:10:20,900 that really smart adaptive intelligence 194 00:10:20,900 --> 00:10:25,120 But you know this space never stands still looking at their roadmap 195 00:10:25,120 --> 00:10:29,200 They've got some fascinating ideas cooking things like an agentic crawler right 196 00:10:29,200 --> 00:10:31,900 like an autonomous system that could handle complex 197 00:10:31,900 --> 00:10:36,240 Multistep data tasks on its own and a knowledge optimal crawler 198 00:10:36,240 --> 00:10:40,970 It makes you wonder and here's a final thought for you our listener to chew on if a 199 00:10:40,970 --> 00:10:42,800 crawler can learn when to stop because it's 200 00:10:42,800 --> 00:10:47,030 Satisfied an information need what other boundaries will AI start redefining and 201 00:10:47,030 --> 00:10:48,240 how we find and use data 202 00:10:48,240 --> 00:10:52,910 Could we see AI systems soon that just autonomously manage the entire research 203 00:10:52,910 --> 00:10:54,960 process from asking the question 204 00:10:54,960 --> 00:10:58,840 To delivering a structured report something to think about. Okay. Let's thank our 205 00:10:58,840 --> 00:11:00,420 supporter one last time safe server 206 00:11:00,420 --> 00:11:03,690 Remember they help with hosting and support your digital transformation. Check them 207 00:11:03,690 --> 00:11:04,200 out at 208 00:11:04,200 --> 00:11:06,440 www.safeserver.de 209 00:11:06,440 --> 00:11:10,270 That's all we have time for in this deep dive. Go explore these ideas further. 210 00:11:10,270 --> 00:11:10,940 Happy crawling