1
00:00:00,000 --> 00:00:04,270
Welcome to the deep dive where we cut through the noise to give you the essentials

2
00:00:04,270 --> 00:00:05,440
fast today

3
00:00:05,440 --> 00:00:07,680
we're getting stuck into something really interesting actually a

4
00:00:07,680 --> 00:00:12,660
Pretty rebellious open source project shaking things up in the AI data world. We're

5
00:00:12,660 --> 00:00:15,840
talking about crawl for AI. It's a web crawler

6
00:00:15,840 --> 00:00:19,240
Yes, but it's specifically designed to take the wealth the chaos of the internet

7
00:00:19,240 --> 00:00:22,560
and turn it into clean structured data that large-language models

8
00:00:22,560 --> 00:00:24,560
Can actually use so our mission today?

9
00:00:25,060 --> 00:00:30,400
Simple we're gonna unpack tech docs the github buzz. I mean this thing has over

10
00:00:30,400 --> 00:00:31,540
fifty five thousand stars

11
00:00:31,540 --> 00:00:35,140
It's the most popular crawler out there right now and figure out what problem it

12
00:00:35,140 --> 00:00:38,200
really solves and importantly how its whole approach is built

13
00:00:38,200 --> 00:00:39,300
around LLMs

14
00:00:39,300 --> 00:00:42,840
We want this to be a really clear starting point for you, even if you're just

15
00:00:42,840 --> 00:00:44,640
dipping your toes into modern data pipelines

16
00:00:44,640 --> 00:00:48,400
But before we dive in a quick shout out to our supporter for this deep dive safe

17
00:00:48,400 --> 00:00:50,680
server safe server helps with hosting powerful software

18
00:00:50,680 --> 00:00:54,630
Like crawl for AI and can really support your digital transformation. You can find

19
00:00:54,630 --> 00:00:57,100
out more at www.safe-server.de

20
00:00:57,100 --> 00:01:01,470
Okay, let's set the scene if you're building anything with AI today, especially

21
00:01:01,470 --> 00:01:03,740
stuff like Argo retrieval augmented generation

22
00:01:03,740 --> 00:01:07,700
Right where the model needs to pull in outside info. Exactly. You hit this wall.

23
00:01:07,700 --> 00:01:10,340
Immediately. The web is just messy for machines

24
00:01:10,340 --> 00:01:14,530
Anyway, it's full of junk menus ads footers all that boilerplate feed that raw

25
00:01:14,530 --> 00:01:15,360
stuff to an LLM

26
00:01:15,440 --> 00:01:20,260
You're wasting tokens like crazy and well the answers you get back are great. That

27
00:01:20,260 --> 00:01:21,880
really is the heart of the problem

28
00:01:21,880 --> 00:01:26,160
It's not just about grabbing data anymore. It's about cleaning it structuring it as

29
00:01:26,160 --> 00:01:26,640
you grab it

30
00:01:26,640 --> 00:01:31,280
And that's where crawl for AI came from. It's open source a crawler and scraper

31
00:01:31,280 --> 00:01:35,860
Yeah, but its main job its whole philosophy is being LLM friendly

32
00:01:35,860 --> 00:01:41,720
The key output the revolutionary bit is that it turns the web into clean LLM ready

33
00:01:41,720 --> 00:01:43,320
markdown markdown

34
00:01:43,320 --> 00:01:46,200
That's the magic ingredient, isn't it? It's simple keeps the structure

35
00:01:46,200 --> 00:01:50,630
You know headers lists but ditches all the messy HTML and CSS that just inflates

36
00:01:50,630 --> 00:01:52,200
token counts and confuses the AI

37
00:01:52,200 --> 00:01:55,660
Precisely and the backstories kind of amazing actually

38
00:01:55,660 --> 00:02:00,160
Yeah, pure developer frustration really the founder who has a background in NLP

39
00:02:00,160 --> 00:02:04,800
was trying to scale up data gathering back in 2023 and

40
00:02:04,800 --> 00:02:09,240
Found the existing tools were just lacking

41
00:02:09,240 --> 00:02:12,360
They were either closed source and expensive or they pretended to be open source

42
00:02:12,360 --> 00:02:15,600
But then you needed accounts API tokens. Sometimes they were hidden fees

43
00:02:15,600 --> 00:02:20,810
It felt like a lock-in right blocking affordable access to do serious work. So what

44
00:02:20,810 --> 00:02:23,320
happened sounds like someone got pretty fired up

45
00:02:23,320 --> 00:02:26,940
Oh, yeah, he literally said he went into turbo anger mode. This wasn't some big

46
00:02:26,940 --> 00:02:28,360
corporate project. It was personal

47
00:02:28,360 --> 00:02:31,700
He built crawl for AI fast put it out there as open source for availability

48
00:02:31,700 --> 00:02:35,720
Meaning anyone could just grab it no strings attached with the goal of affordability

49
00:02:35,720 --> 00:02:40,480
The whole idea is democratizing knowledge making sure a structured text images

50
00:02:40,480 --> 00:02:41,920
metadata all prep for AI

51
00:02:41,920 --> 00:02:46,920
Isn't stuck on a paywall. Okay, that explains the massive github following that

52
00:02:46,920 --> 00:02:48,120
drive for openness

53
00:02:48,120 --> 00:02:52,380
But you know developers need tools that actually work not just ones with a good

54
00:02:52,380 --> 00:02:52,800
story

55
00:02:52,800 --> 00:02:58,060
So let's shift gears. What are the technical chops that make this thing stand out?

56
00:02:58,060 --> 00:03:03,070
Speed and control seem to be big selling points. Definitely performance and control

57
00:03:03,070 --> 00:03:03,880
are paramount

58
00:03:03,880 --> 00:03:07,800
They call it fast in practice and that comes down to smart design choices

59
00:03:08,320 --> 00:03:13,420
Like it uses an async browser pool for anyone new to that async just means it doesn't

60
00:03:13,420 --> 00:03:14,220
do things one by one

61
00:03:14,220 --> 00:03:17,340
It juggles hundreds of browser requests at the same time

62
00:03:17,340 --> 00:03:21,280
This cuts down waiting time uses your hardware way more efficiently, right?

63
00:03:21,280 --> 00:03:23,560
That makes sense a big efficiency boost right there

64
00:03:23,560 --> 00:03:27,390
But the web fights back doesn't it sites use JavaScript load things late and they're

65
00:03:27,390 --> 00:03:28,720
actively trying to block bots

66
00:03:28,720 --> 00:03:33,460
How does crawl for AI handle that minefield? That's where full browser control

67
00:03:33,460 --> 00:03:35,520
comes in. It's not pretending to be a browser

68
00:03:35,640 --> 00:03:39,850
It's actually driving real browser instances. It uses something called the Chrome

69
00:03:39,850 --> 00:03:41,600
developer tools protocol

70
00:03:41,600 --> 00:03:46,220
So it sees the page exactly like you would it runs the JavaScript waits for stuff

71
00:03:46,220 --> 00:03:46,920
to load in

72
00:03:46,920 --> 00:03:51,280
Handles those images that only appear when you scroll down lazy loading

73
00:03:51,280 --> 00:03:55,880
Okay, and it can even simulate scrolling down the page what they call full page

74
00:03:55,880 --> 00:03:59,000
scanning to grab content on those infinite scroll sites

75
00:03:59,000 --> 00:04:02,850
Okay, that's clever, but let's talk bot detection. That's the big headache for many

76
00:04:02,850 --> 00:04:04,860
people right? Yeah, you mentioned stealth mode

77
00:04:05,400 --> 00:04:08,760
Sounds great, but doesn't running a full browser trying to look human make

78
00:04:08,760 --> 00:04:10,460
everything much slower and heavier

79
00:04:10,460 --> 00:04:15,040
What's the real-world trade-off there for getting past Cloudflare or Akamai? That's

80
00:04:15,040 --> 00:04:16,460
a fair question. Absolutely

81
00:04:16,460 --> 00:04:19,910
There's always a trade-off running a full browser is more resource intensive than a

82
00:04:19,910 --> 00:04:21,100
simple request no doubt

83
00:04:21,100 --> 00:04:25,000
but crawl for AI tries to balance that with the async stuff and smart caching we

84
00:04:25,000 --> 00:04:25,840
mentioned and

85
00:04:25,840 --> 00:04:31,440
Honestly the benefit of stealth mode which uses configurations to mimic a real user

86
00:04:31,800 --> 00:04:35,760
Often outweighs the cost because the alternative is just getting blocked

87
00:04:35,760 --> 00:04:40,110
Failing the crawl completely plus it handles the practical things you need like

88
00:04:40,110 --> 00:04:42,200
using proxies managing sessions

89
00:04:42,200 --> 00:04:45,770
So you can stay logged in keeping cookies persistent if you need to scrape behind a

90
00:04:45,770 --> 00:04:47,540
login got it, so it's fast

91
00:04:47,540 --> 00:04:52,540
It's sneaky when it needs to be and it really controls the browser environment now

92
00:04:52,540 --> 00:04:56,880
Let's get to the AI part of crawl for AI the output is AI friendly clear

93
00:04:57,080 --> 00:05:01,510
But how does the crawler itself use intelligence not just to grab stuff, but to

94
00:05:01,510 --> 00:05:04,280
filter it maybe even learn this feels like the core innovation

95
00:05:04,280 --> 00:05:07,470
Yeah, this is where you see features really tailored for optimizing tokens and

96
00:05:07,470 --> 00:05:08,800
boosting our each performance

97
00:05:08,800 --> 00:05:10,800
It starts with how it generates the markdown

98
00:05:10,800 --> 00:05:15,450
You've got your basic clean markdown which just strips out the obvious HTML junk,

99
00:05:15,450 --> 00:05:15,560
right?

100
00:05:15,560 --> 00:05:19,560
But then there's fit markdown this uses heuristic based filtering. Okay heuristic

101
00:05:19,560 --> 00:05:20,400
filtering

102
00:05:20,400 --> 00:05:23,080
Can you break that down a bit for someone new to this?

103
00:05:23,080 --> 00:05:26,800
What does that actually mean for the data they get sure?

104
00:05:27,080 --> 00:05:29,520
Think of heuristics as smart rules of thumb

105
00:05:29,520 --> 00:05:34,840
The crawler uses these rules to guess what parts of a web page are probably useless

106
00:05:34,840 --> 00:05:37,600
You know navigation menus sidebars footers

107
00:05:37,600 --> 00:05:41,680
Maybe comment sections fit markdown tries to identify and just delete that stuff

108
00:05:41,680 --> 00:05:42,240
automatically

109
00:05:42,240 --> 00:05:47,690
Imagine cutting say 40% of the useless words the tokens from a page just by being

110
00:05:47,690 --> 00:05:49,360
smart about removing the boilerplate

111
00:05:49,360 --> 00:05:51,840
That saves you real money on LLM calls

112
00:05:51,880 --> 00:05:55,830
But more importantly, it makes your area G system way more accurate because the AI

113
00:05:55,830 --> 00:05:58,680
is only looking at the actual content the important stuff

114
00:05:58,680 --> 00:06:02,790
Right efficiency through intelligence makes sense. What if I have a really specific

115
00:06:02,790 --> 00:06:03,040
goal?

116
00:06:03,040 --> 00:06:06,870
Like I only want the q4 earnings numbers from a company site. How does it filter

117
00:06:06,870 --> 00:06:07,720
out all the other noise?

118
00:06:07,720 --> 00:06:12,480
For that kind of targeted crawl it can use the BM 25 algorithm. Yeah, I'm 20

119
00:06:12,480 --> 00:06:15,680
Yeah, it's a well-known ranking function from information retrieval

120
00:06:15,680 --> 00:06:20,040
Basically think of it as a sophisticated way to score how relevant a piece of text

121
00:06:20,040 --> 00:06:21,760
is to your specific search query

122
00:06:21,760 --> 00:06:24,740
So if you tell the crawler you're looking for

123
00:06:24,740 --> 00:06:26,340
q4

124
00:06:26,340 --> 00:06:31,720
2024 earnings report BM 25 helps ensure the final markdown focuses tightly on text

125
00:06:31,720 --> 00:06:32,980
related to those terms

126
00:06:32,980 --> 00:06:37,400
It helps ignore the CEOs blog post or the company picnic photos

127
00:06:37,400 --> 00:06:41,160
Okay. Now this is where it gets really cutting-edge based on the source docs

128
00:06:41,160 --> 00:06:43,400
adaptive crawling

129
00:06:43,400 --> 00:06:47,240
My understanding is this helps the crawler know when it's found enough information

130
00:06:47,240 --> 00:06:48,760
like when to stop

131
00:06:48,880 --> 00:06:53,000
Spot-on and that's huge for saving resources old-school crawlers

132
00:06:53,000 --> 00:06:57,320
Just follow links deeper and deeper until they hit some arbitrary limit super wasteful

133
00:06:57,320 --> 00:06:58,720
adaptive crawling is smarter

134
00:06:58,720 --> 00:07:03,090
It uses these advanced information forging algorithms fancy term that basically the

135
00:07:03,090 --> 00:07:05,400
crawler learns the site structure as it goes

136
00:07:05,400 --> 00:07:08,830
It's constantly asking is the new information I'm finding actually relevant to the

137
00:07:08,830 --> 00:07:09,680
original query

138
00:07:09,680 --> 00:07:12,860
Yeah, and it figures out when it's gathered enough information to likely answer

139
00:07:12,860 --> 00:07:15,520
that query based on a confidence level you can set

140
00:07:15,520 --> 00:07:20,330
So instead of blindly following 50 links and maybe only 10 were useful it might

141
00:07:20,330 --> 00:07:22,560
stop after 15 because it thinks okay

142
00:07:22,560 --> 00:07:27,240
I probably got what I need exactly that you set a threshold it hits it and that

143
00:07:27,240 --> 00:07:28,880
specific crawl job shuts down

144
00:07:28,880 --> 00:07:32,390
It's optimizing based on knowledge gathering not just link counting very smart

145
00:07:32,390 --> 00:07:35,860
efficiency and intelligence working together. Okay one more intelligence piece

146
00:07:35,860 --> 00:07:41,510
Structured data tables are vital right for databases training models, but huge

147
00:07:41,510 --> 00:07:43,360
tables can crash scrapers

148
00:07:43,360 --> 00:07:47,840
Yeah, memory limits are a classic problem crawl for AI tackles this with what they

149
00:07:47,840 --> 00:07:48,200
call

150
00:07:48,200 --> 00:07:54,440
Revolutionary LLM table extraction you can still use the old way CSS selectors XPath

151
00:07:54,440 --> 00:07:58,120
But it can also use LLMs directly for pulling out table data

152
00:07:58,120 --> 00:08:03,880
The clever part is intelligent chunking instead of trying to load a massive multi-page

153
00:08:03,880 --> 00:08:06,160
table into memory all at once which often fails

154
00:08:06,160 --> 00:08:08,760
It breaks the table into smaller manageable pieces

155
00:08:09,120 --> 00:08:13,320
uses the LLM to process each chunk extract the data clean it up and then stitches

156
00:08:13,320 --> 00:08:16,080
the results back together seamlessly it's built for

157
00:08:16,080 --> 00:08:20,860
Handling really big data sets that covers the what and how brilliantly so last

158
00:08:20,860 --> 00:08:22,640
piece deployment

159
00:08:22,640 --> 00:08:26,800
How easy is it for say a developer or a small team to actually get this thing

160
00:08:26,800 --> 00:08:29,280
running and plugged into their workflow?

161
00:08:29,280 --> 00:08:34,790
Well, the basic install is super easy if you use Python just pip install that you

162
00:08:34,790 --> 00:08:36,480
crawl for I standard stuff

163
00:08:36,760 --> 00:08:42,030
But they clearly built it knowing that real world large-scale use needs more than

164
00:08:42,030 --> 00:08:43,800
just running it on your laptop

165
00:08:43,800 --> 00:08:47,870
Which naturally leads to their docker setup. I imagine exactly the Docker I set up

166
00:08:47,870 --> 00:08:49,260
is really key for production

167
00:08:49,260 --> 00:08:53,720
It bundles everything up neatly you get a ready-to-go fast PI server for handling

168
00:08:53,720 --> 00:08:57,040
API requests is built in security with JWT tokens

169
00:08:57,040 --> 00:09:00,280
And it's designed to be deployed in the cloud and handle lots of crawl jobs

170
00:09:00,280 --> 00:09:04,340
Simultaneously and I saw something crucial in the latest updates webhooks that

171
00:09:04,340 --> 00:09:06,660
feels like a major quality of life improvement

172
00:09:06,660 --> 00:09:09,280
Right. No more constantly checking if a job is done

173
00:09:09,280 --> 00:09:13,480
Oh, absolutely gets rid of that tedious polling process the recent version added a

174
00:09:13,480 --> 00:09:16,640
full webhook system for the Docker job queue API

175
00:09:16,640 --> 00:09:22,680
So yeah, no more polling is the headline there crawl for AI can now actively tell

176
00:09:22,680 --> 00:09:23,640
your other systems

177
00:09:23,640 --> 00:09:28,990
When a crawl job is finished or when an LLM extraction task is complete. It sends

178
00:09:28,990 --> 00:09:30,640
out real-time notifications

179
00:09:30,640 --> 00:09:35,110
They even built in retry logic. So if you're receiving system hiccups, it won't

180
00:09:35,110 --> 00:09:35,680
just fail

181
00:09:35,680 --> 00:09:39,400
It makes integration much smoother that really brings it all together. Doesn't it

182
00:09:39,400 --> 00:09:42,820
back to that original mission building an independent powerful tool?

183
00:09:42,820 --> 00:09:46,790
That's genuinely accessible. They didn't just build a better crawler. They build a

184
00:09:46,790 --> 00:09:49,500
whole transparent ecosystem for getting data

185
00:09:49,500 --> 00:09:51,880
Absolutely. Yeah, their mission is clear about that

186
00:09:51,880 --> 00:09:56,480
Fostering a shared data economy making sure AI gets fed by real human knowledge

187
00:09:56,480 --> 00:09:59,880
staying transparent that tiered sponsorship program

188
00:09:59,880 --> 00:10:03,550
They have is specifically to keep the core project free and independent true to

189
00:10:03,550 --> 00:10:05,880
that original rebellion against walled gardens

190
00:10:05,880 --> 00:10:11,310
It really is a great summary of crawl for AI then a powerful open source way to

191
00:10:11,310 --> 00:10:13,100
bridge the gap between the messy web

192
00:10:13,100 --> 00:10:18,660
And what modern AI actually needs structured clean data driven by speed control and

193
00:10:18,660 --> 00:10:20,900
that really smart adaptive intelligence

194
00:10:20,900 --> 00:10:25,120
But you know this space never stands still looking at their roadmap

195
00:10:25,120 --> 00:10:29,200
They've got some fascinating ideas cooking things like an agentic crawler right

196
00:10:29,200 --> 00:10:31,900
like an autonomous system that could handle complex

197
00:10:31,900 --> 00:10:36,240
Multistep data tasks on its own and a knowledge optimal crawler

198
00:10:36,240 --> 00:10:40,970
It makes you wonder and here's a final thought for you our listener to chew on if a

199
00:10:40,970 --> 00:10:42,800
crawler can learn when to stop because it's

200
00:10:42,800 --> 00:10:47,030
Satisfied an information need what other boundaries will AI start redefining and

201
00:10:47,030 --> 00:10:48,240
how we find and use data

202
00:10:48,240 --> 00:10:52,910
Could we see AI systems soon that just autonomously manage the entire research

203
00:10:52,910 --> 00:10:54,960
process from asking the question

204
00:10:54,960 --> 00:10:58,840
To delivering a structured report something to think about. Okay. Let's thank our

205
00:10:58,840 --> 00:11:00,420
supporter one last time safe server

206
00:11:00,420 --> 00:11:03,690
Remember they help with hosting and support your digital transformation. Check them

207
00:11:03,690 --> 00:11:04,200
out at

208
00:11:04,200 --> 00:11:06,440
www.safeserver.de

209
00:11:06,440 --> 00:11:10,270
That's all we have time for in this deep dive. Go explore these ideas further.

210
00:11:10,270 --> 00:11:10,940
Happy crawling