Welcome to the deep dive where we cut through the noise to give you the essentials
fast today
we're getting stuck into something really interesting actually a
Pretty rebellious open source project shaking things up in the AI data world. We're
talking about crawl for AI. It's a web crawler
Yes, but it's specifically designed to take the wealth the chaos of the internet
and turn it into clean structured data that large-language models
Can actually use so our mission today?
Simple we're gonna unpack tech docs the github buzz. I mean this thing has over
fifty five thousand stars
It's the most popular crawler out there right now and figure out what problem it
really solves and importantly how its whole approach is built
around LLMs
We want this to be a really clear starting point for you, even if you're just
dipping your toes into modern data pipelines
But before we dive in a quick shout out to our supporter for this deep dive safe
server safe server helps with hosting powerful software
Like crawl for AI and can really support your digital transformation. You can find
out more at www.safe-server.de
Okay, let's set the scene if you're building anything with AI today, especially
stuff like Argo retrieval augmented generation
Right where the model needs to pull in outside info. Exactly. You hit this wall.
Immediately. The web is just messy for machines
Anyway, it's full of junk menus ads footers all that boilerplate feed that raw
stuff to an LLM
You're wasting tokens like crazy and well the answers you get back are great. That
really is the heart of the problem
It's not just about grabbing data anymore. It's about cleaning it structuring it as
you grab it
And that's where crawl for AI came from. It's open source a crawler and scraper
Yeah, but its main job its whole philosophy is being LLM friendly
The key output the revolutionary bit is that it turns the web into clean LLM ready
markdown markdown
That's the magic ingredient, isn't it? It's simple keeps the structure
You know headers lists but ditches all the messy HTML and CSS that just inflates
token counts and confuses the AI
Precisely and the backstories kind of amazing actually
Yeah, pure developer frustration really the founder who has a background in NLP
was trying to scale up data gathering back in 2023 and
Found the existing tools were just lacking
They were either closed source and expensive or they pretended to be open source
But then you needed accounts API tokens. Sometimes they were hidden fees
It felt like a lock-in right blocking affordable access to do serious work. So what
happened sounds like someone got pretty fired up
Oh, yeah, he literally said he went into turbo anger mode. This wasn't some big
corporate project. It was personal
He built crawl for AI fast put it out there as open source for availability
Meaning anyone could just grab it no strings attached with the goal of affordability
The whole idea is democratizing knowledge making sure a structured text images
metadata all prep for AI
Isn't stuck on a paywall. Okay, that explains the massive github following that
drive for openness
But you know developers need tools that actually work not just ones with a good
story
So let's shift gears. What are the technical chops that make this thing stand out?
Speed and control seem to be big selling points. Definitely performance and control
are paramount
They call it fast in practice and that comes down to smart design choices
Like it uses an async browser pool for anyone new to that async just means it doesn't
do things one by one
It juggles hundreds of browser requests at the same time
This cuts down waiting time uses your hardware way more efficiently, right?
That makes sense a big efficiency boost right there
But the web fights back doesn't it sites use JavaScript load things late and they're
actively trying to block bots
How does crawl for AI handle that minefield? That's where full browser control
comes in. It's not pretending to be a browser
It's actually driving real browser instances. It uses something called the Chrome
developer tools protocol
So it sees the page exactly like you would it runs the JavaScript waits for stuff
to load in
Handles those images that only appear when you scroll down lazy loading
Okay, and it can even simulate scrolling down the page what they call full page
scanning to grab content on those infinite scroll sites
Okay, that's clever, but let's talk bot detection. That's the big headache for many
people right? Yeah, you mentioned stealth mode
Sounds great, but doesn't running a full browser trying to look human make
everything much slower and heavier
What's the real-world trade-off there for getting past Cloudflare or Akamai? That's
a fair question. Absolutely
There's always a trade-off running a full browser is more resource intensive than a
simple request no doubt
but crawl for AI tries to balance that with the async stuff and smart caching we
mentioned and
Honestly the benefit of stealth mode which uses configurations to mimic a real user
Often outweighs the cost because the alternative is just getting blocked
Failing the crawl completely plus it handles the practical things you need like
using proxies managing sessions
So you can stay logged in keeping cookies persistent if you need to scrape behind a
login got it, so it's fast
It's sneaky when it needs to be and it really controls the browser environment now
Let's get to the AI part of crawl for AI the output is AI friendly clear
But how does the crawler itself use intelligence not just to grab stuff, but to
filter it maybe even learn this feels like the core innovation
Yeah, this is where you see features really tailored for optimizing tokens and
boosting our each performance
It starts with how it generates the markdown
You've got your basic clean markdown which just strips out the obvious HTML junk,
right?
But then there's fit markdown this uses heuristic based filtering. Okay heuristic
filtering
Can you break that down a bit for someone new to this?
What does that actually mean for the data they get sure?
Think of heuristics as smart rules of thumb
The crawler uses these rules to guess what parts of a web page are probably useless
You know navigation menus sidebars footers
Maybe comment sections fit markdown tries to identify and just delete that stuff
automatically
Imagine cutting say 40% of the useless words the tokens from a page just by being
smart about removing the boilerplate
That saves you real money on LLM calls
But more importantly, it makes your area G system way more accurate because the AI
is only looking at the actual content the important stuff
Right efficiency through intelligence makes sense. What if I have a really specific
goal?
Like I only want the q4 earnings numbers from a company site. How does it filter
out all the other noise?
For that kind of targeted crawl it can use the BM 25 algorithm. Yeah, I'm 20
Yeah, it's a well-known ranking function from information retrieval
Basically think of it as a sophisticated way to score how relevant a piece of text
is to your specific search query
So if you tell the crawler you're looking for
q4
2024 earnings report BM 25 helps ensure the final markdown focuses tightly on text
related to those terms
It helps ignore the CEOs blog post or the company picnic photos
Okay. Now this is where it gets really cutting-edge based on the source docs
adaptive crawling
My understanding is this helps the crawler know when it's found enough information
like when to stop
Spot-on and that's huge for saving resources old-school crawlers
Just follow links deeper and deeper until they hit some arbitrary limit super wasteful
adaptive crawling is smarter
It uses these advanced information forging algorithms fancy term that basically the
crawler learns the site structure as it goes
It's constantly asking is the new information I'm finding actually relevant to the
original query
Yeah, and it figures out when it's gathered enough information to likely answer
that query based on a confidence level you can set
So instead of blindly following 50 links and maybe only 10 were useful it might
stop after 15 because it thinks okay
I probably got what I need exactly that you set a threshold it hits it and that
specific crawl job shuts down
It's optimizing based on knowledge gathering not just link counting very smart
efficiency and intelligence working together. Okay one more intelligence piece
Structured data tables are vital right for databases training models, but huge
tables can crash scrapers
Yeah, memory limits are a classic problem crawl for AI tackles this with what they
call
Revolutionary LLM table extraction you can still use the old way CSS selectors XPath
But it can also use LLMs directly for pulling out table data
The clever part is intelligent chunking instead of trying to load a massive multi-page
table into memory all at once which often fails
It breaks the table into smaller manageable pieces
uses the LLM to process each chunk extract the data clean it up and then stitches
the results back together seamlessly it's built for
Handling really big data sets that covers the what and how brilliantly so last
piece deployment
How easy is it for say a developer or a small team to actually get this thing
running and plugged into their workflow?
Well, the basic install is super easy if you use Python just pip install that you
crawl for I standard stuff
But they clearly built it knowing that real world large-scale use needs more than
just running it on your laptop
Which naturally leads to their docker setup. I imagine exactly the Docker I set up
is really key for production
It bundles everything up neatly you get a ready-to-go fast PI server for handling
API requests is built in security with JWT tokens
And it's designed to be deployed in the cloud and handle lots of crawl jobs
Simultaneously and I saw something crucial in the latest updates webhooks that
feels like a major quality of life improvement
Right. No more constantly checking if a job is done
Oh, absolutely gets rid of that tedious polling process the recent version added a
full webhook system for the Docker job queue API
So yeah, no more polling is the headline there crawl for AI can now actively tell
your other systems
When a crawl job is finished or when an LLM extraction task is complete. It sends
out real-time notifications
They even built in retry logic. So if you're receiving system hiccups, it won't
just fail
It makes integration much smoother that really brings it all together. Doesn't it
back to that original mission building an independent powerful tool?
That's genuinely accessible. They didn't just build a better crawler. They build a
whole transparent ecosystem for getting data
Absolutely. Yeah, their mission is clear about that
Fostering a shared data economy making sure AI gets fed by real human knowledge
staying transparent that tiered sponsorship program
They have is specifically to keep the core project free and independent true to
that original rebellion against walled gardens
It really is a great summary of crawl for AI then a powerful open source way to
bridge the gap between the messy web
And what modern AI actually needs structured clean data driven by speed control and
that really smart adaptive intelligence
But you know this space never stands still looking at their roadmap
They've got some fascinating ideas cooking things like an agentic crawler right
like an autonomous system that could handle complex
Multistep data tasks on its own and a knowledge optimal crawler
It makes you wonder and here's a final thought for you our listener to chew on if a
crawler can learn when to stop because it's
Satisfied an information need what other boundaries will AI start redefining and
how we find and use data
Could we see AI systems soon that just autonomously manage the entire research
process from asking the question
To delivering a structured report something to think about. Okay. Let's thank our
supporter one last time safe server
Remember they help with hosting and support your digital transformation. Check them
out at
www.safeserver.de
Happy crawling
Happy crawling