Today's Deep-Dive: Firecrawl
Ep. 245

Today's Deep-Dive: Firecrawl

Episode description

This deep dive discusses Firecrawl, a tool designed to help AI systems access and process web data efficiently. It highlights the challenges of obtaining clean, structured data from the internet, which is often messy and complex. Firecrawl addresses these issues by providing a web data API that converts web content into LLM-ready formats, making it easier for AI models to understand and use. The tool offers features like scraping, crawling, mapping, searching, extracting, and performing actions to interact with web pages. It handles dynamic content, JavaScript-heavy sites, and anti-bot measures, ensuring reliable data extraction. Firecrawl is accessible through open-source and cloud-based options, making it suitable for both developers and beginners. The tool aims to enhance AI applications by providing up-to-date, structured data, potentially unlocking new possibilities in research, sales, and market analysis.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now for 1 Euro - 30 days free!

Download transcript (.srt)
0:00

Hey there curious minds and welcome to the deep dive.

0:03

We're so glad you're here today.

0:04

We're getting into something really key in our AI world.

0:07

It's a it's a challenge lots of us face.

0:10

How do you get good clean data off the internet, especially for,

0:14

you know, AI and we want to be super clear, especially if you're

0:17

maybe just starting out with this stuff easy to understand.

0:20

That's the goal.

0:20

But hang on before we jump in a huge.

0:22

Thank you to our supporter for this deep dive safe server.

0:26

They handled a hosting for exactly this kind of cutting-edge software

0:29

making sure it runs smoothly.

0:30

They're really partners in digital transformation.

0:33

If you need solid hosting and expert support, you can find out

0:35

more at www.safeserver.de.

0:38

Let me say that again, www.safeserver.de.

0:42

Okay.

0:42

So let's unpack this.

0:44

AI can do amazing things.

0:46

We've all seen it writing answering questions, incredible stuff.

0:49

But here's the thing, right?

0:50

And AI is only as smart as the info it gets and the internet.

0:53

Well, it's not exactly a neat library.

0:54

Is it?

0:55

It's more like a massive messy constantly changing garage sale.

0:58

Some treasures.

0:59

Yeah, but lots of clutter too.

1:00

And that's where our topic today, Firecrawl, comes into the picture.

1:04

Exactly.

1:05

Firecrawl is basically, well, it's a specialized tool.

1:08

Technically, it's a web data API.

1:10

That just means it helps computer programs get information from the

1:13

web and its specific job to take any website, even the really complicated

1:18

ones, and turn its content into what we call LLM ready data.

1:22

LLM, large language model, right?

1:25

Like the brains behind chat GPT.

1:26

Precisely.

1:27

So LLM ready just means the data is cleaned up, structured in a way

1:30

these AI models can easily understand and use.

1:33

Could be clean text, maybe like markdown, or more organized

1:37

info, like in a table.

1:38

Gotcha.

1:39

So our mission today is to cut through the tech talk and really show

1:42

you how Firecrawl makes this whole process simpler.

1:45

Simpler for anyone building AI tools, doing research, or honestly just

1:49

curious about how AI actually learns from the web.

1:51

We want you to have those aha moments, you know, whether you're a

1:54

builder or just watching it all happen.

1:55

So let's dive into Firecrawl.

1:57

Okay, let's start with the big why.

1:58

Why do we even need a tool like this?

1:59

What's wrong with just pointing an AI at a website?

2:02

Can't it just read it like we do?

2:03

That's a really great question actually.

2:05

Because it gets right to a core difference.

2:07

How humans and machines will see the web.

2:11

For us, it's visual.

2:12

We see headings, pictures, menus, ads.

2:14

Yeah.

2:15

We kind of intuitively know what's important content and what's

2:18

just, you know, Chrome.

2:19

Right.

2:20

The layout helps us understand.

2:21

Exactly.

2:22

But for an AI, a website is mostly just a big pile of code.

2:27

HTML, JavaScript, maybe some CSS for styling.

2:31

Trying to find the actual meaning in just the raw code.

2:35

It's tough.

2:36

Super overwhelming.

2:37

Okay.

2:38

And then you add in things like dynamic content stuff that

2:40

only loads when you scroll or click a button.

2:43

Oh yeah.

2:43

Like infinite scroll pages.

2:45

Right.

2:45

Or pop-up ads getting in the way.

2:47

Trying to pull out just the meaningful structured data from

2:50

all that chaos.

2:51

That's a huge challenge.

2:52

So web scraping, that's the term, right?

2:54

Is that what FireCrawl does?

2:55

And is regular scraping really that hard?

2:57

Yes.

2:58

FireCrawl is essentially a very advanced web scraper.

3:00

Yeah.

3:00

And yeah, traditional scraping.

3:02

Yeah.

3:02

Notoriously difficult.

3:04

Why is that?

3:04

Well, for starters, websites change their layout all the

3:07

time.

3:08

So your scraper breaks.

3:09

Then many sites actively try to block scrapers.

3:12

They look for bot-like activity.

3:14

Plus, so many sites rely heavily on JavaScript now.

3:17

The content you want might not even be in the initial HTML

3:20

code.

3:20

It only appears after scripts run.

3:22

Right.

3:22

So just grabbing the source code isn't enough.

3:24

Not nearly enough.

3:25

And then you have things like Cap ECHAs, those prove your

3:28

human tests.

3:29

They're specifically designed to stop automated tools like

3:32

scrapers.

3:33

Developers can spend honestly more time fighting these issues

3:37

than actually getting the data they need.

3:39

It's a constant battle.

3:40

Okay.

3:40

This is where it gets really interesting then.

3:42

Firecrawl claims it solves this hard stuff.

3:45

They say they can reliably get data from what?

3:48

96% of the web.

3:49

That sounds ambitious.

3:51

How does it actually do that, especially with those tricky

3:54

JavaScript sites or ones trying to block you?

3:55

What's the magic?

3:56

Well, it's not exactly magic, but it is clever engineering.

4:00

Firecrawl was built specifically to automate all those

4:03

complexities you just mentioned.

4:04

Think about the hidden work in traditional scraping.

4:07

Managing proxies basically, different IP addresses to hide

4:10

behind.

4:11

Right.

4:11

So the website doesn't block you from making too many

4:14

requests.

4:14

Exactly.

4:15

Rotating them, making sure they're from the right regions,

4:18

handling failures.

4:19

It's a nightmare.

4:20

Yeah.

4:20

Developers call it proxy headaches for a reason.

4:23

Firecrawl handles all that automatically behind the scenes.

4:27

You don't even see it.

4:28

Okay.

4:28

So no more managing lists of IP addresses.

4:31

Nope.

4:32

And no need for complex setups like headless browsers or puppet

4:36

browsers.

4:37

Those are basically running a full web browser invisibly just

4:41

to trick the site into rendering JavaScript.

4:43

Which must be slow and resource intensive.

4:46

Very.

4:46

Firecrawl has more efficient ways.

4:48

It handles rotating proxies.

4:50

It manages how fast it makes requests so it doesn't overload

4:53

the site respecting write limits.

4:55

It even has something they call smart weight, which means it

4:58

intelligently waits for dynamic content to load before grabbing

5:01

it.

5:01

Smart weight.

5:02

Yeah.

5:03

It's like having this really skilled agent navigating the

5:06

web for you ensuring you get clean data without all the usual

5:09

technical drama.

5:09

So it's not just grabbing raw code.

5:11

It's processing it intelligently.

5:13

Yeah.

5:13

And it's fast.

5:14

The source said results in under a second.

5:16

That seems incredibly fast for all that work.

5:19

It is remarkably fast.

5:20

Yeah, and that speed is absolutely crucial for a lot of

5:23

modern AI applications.

5:25

Like what?

5:25

Think about real-time AI agents.

5:27

Maybe a chatbot that needs to look up the latest product info

5:32

or breaking news right now to answer your question accurately.

5:36

So needs fresh data instantly.

5:38

Exactly.

5:38

That sub-second response time means the AI can access current

5:42

relevant information almost immediately.

5:44

That's a huge deal for how useful it can be.

5:47

That makes sense.

5:48

But what about that other 4%?

5:49

If it handles 96%, what happens when it hits a really, really tough site?

5:53

Does it just give up or slow right down?

5:55

That's a fair point.

5:56

No system is perfect, especially with how fast web defenses evolve.

5:59

That remaining 4%, you're often talking about extremely protected

6:03

sites, maybe highly sensitive financial portals or sites

6:07

with very aggressive anti-bot tech.

6:09

In those rare cases, yeah, it might take longer or maybe access

6:13

could still be blocked occasionally.

6:14

But FireCrawl is constantly being updated.

6:16

They're always working on improving that success rate, tackling

6:19

new challenges.

6:19

The aim is to make those failures really, really rare for most

6:23

common data needs.

6:24

Oh, okay.

6:25

That clarifies things.

6:26

So we understand the why the web is messy for AI.

6:29

Now let's dig into the how.

6:31

What are the actual tools in FireCrawl's toolkit?

6:34

Let's start simple.

6:36

If I just want the info from one specific webpage, what do I use?

6:39

For that, you'd use the scrape feature.

6:41

Super straightforward.

6:43

You give it a URL, the web address.

6:44

Like copy-pasting from my browser.

6:46

Exactly.

6:46

You give FireCrawl that URL, and it goes and gets the content.

6:49

But crucially, it returns it in those LLM-ready formats we talked about.

6:53

And LLM-ready isn't just the raw HTML.

6:56

Right.

6:56

Raw HTML is full of stuff for presentation fonts, colors, ads,

7:00

menus that just confuses an AI.

7:02

FireCrawl intelligently pulls out the core meaning, the actual content.

7:06

It keeps the structure, like headings and lists, but cleans it up.

7:10

Often, it gives you markdown.

7:12

Markdown, like simple text formatting.

7:15

Yeah, it's a format that LLMs understand really well, helps them figure

7:18

out relationships in the text, summarize better, you know, fewer

7:21

mistakes or hallucinations.

7:23

So you can get clean markdown or maybe structured data like JSON, the

7:27

raw HTML, if you really need it, links on the page, even a screenshot

7:30

sometimes, or metadata.

7:32

The point is always give the AI data it can use straight away.

7:35

No extra cleaning needed.

7:37

Okay.

7:37

So Scrape is like getting a perfect snapshot of one page.

7:41

But what if I need a lot of pages?

7:43

Like say, all the product descriptions from an online store

7:46

or every blog post from a site.

7:48

Scraping one by one sounds tedious.

7:51

Definitely tedious.

7:52

And that's exactly what the crawl feature is for.

7:54

Instead of just one page, FireCrawl starts at a URL you give it, and

7:59

then it follows the links to crawl all the accessible subages on that site.

8:02

So it maps out the site itself.

8:03

It navigates it.

8:04

Yeah.

8:04

And a really neat thing, especially for beginners, is that you

8:07

usually don't need a site map.

8:08

A site map.

8:09

That's like a table of contents for a website.

8:11

Kind of, yeah.

8:12

It tells search engines what pages exist.

8:14

FireCrawl often doesn't need one.

8:16

It just explores like a user would.

8:18

Clicking links.

8:19

You submit what they call a crawl job.

8:21

It goes off and does its thing.

8:22

And then you can check the progress and grab all the

8:24

collected data when it's done.

8:26

It gives your AI a much broader view of a site.

8:29

Okay.

8:29

Scrape is a snapshot.

8:30

Crawl is a full exploration.

8:32

What about map then?

8:33

How's that different from crawl?

8:35

Good question.

8:36

Map is specifically about discovering the structure

8:39

or the layout of a website.

8:41

It's much faster than crawl because it doesn't actually

8:44

download the content of all the pages.

8:47

You give it a URL and it quickly comes back with a

8:49

list of most of the links it finds on that starting

8:52

page and potentially deeper links too.

8:54

So just the links not the content.

8:56

Primarily the links.

8:57

Yeah.

8:57

You can even add a search term to find specific kinds

9:00

of links.

9:01

Like you could say map this site and show me all

9:04

URLs that have blog in them.

9:05

It's super fast for just understanding how a site

9:09

is put together.

9:09

Maybe before you decide what you want to crawl in

9:11

detail useful for planning.

9:13

Okay, and then there's search.

9:15

This sounds different again like using Google but

9:18

through fire crawl pretty much fire crawl search

9:21

API lets you perform web searches just like using

9:24

a search engine.

9:25

But here's the cool part.

9:27

You can tell it to automatically scrape the content

9:29

from the top search results it finds all in one

9:32

step.

9:32

Oh wow.

9:33

So search and scrape together exactly you can set

9:36

things like the language or country for the search

9:39

choose your output format.

9:40

It makes targeted research really efficient.

9:43

Imagine needing all recent news articles on a topic

9:46

search finds them scrape gets the content all through

9:49

one API call.

9:51

It saves a ton of manual work.

9:52

Okay.

9:53

These all sound great for getting text but often for

9:56

AI you need very specific pieces of information

9:59

structured neatly.

10:00

That's where extract comes in right.

10:01

How does that work.

10:02

Yes extract this tackles a really key problem.

10:05

How do you pull out specific organized facts like

10:08

say a company's founding year or the price of a

10:11

product or whether it supports feature X from a

10:14

web page.

10:15

That's just you know paragraphs of text.

10:16

Yeah or it's not in a nice neat table already precisely

10:20

the extract feature lets you define exactly what

10:22

you're looking for.

10:23

You can use a prompt which is just a natural

10:25

language instruction like telling it in plain

10:27

English.

10:28

Yeah like find the main contact email address or

10:31

extract the key features listed or and this is

10:34

really powerful you can provide a schema a schema

10:37

like a template exactly think of it like a fill

10:40

in the blanks form you define the fields you want

10:42

company name text bounding year number is hiring

10:46

fire crawl then uses its AI smarts to read the page

10:50

and intelligently pull out only that specific

10:52

information and put it into your defined structure.

10:55

So I could say get me the company mission statement

10:57

and tell me if they mentioned sustainability

10:59

and it would return just those two pieces of info

11:02

neatly labeled.

11:03

That's the idea and you can do this for one page

11:06

or run it across multiple pages or even a whole

11:08

domain using wildcards like.com about for developers

11:12

it even supports common schema tools like Pydantic

11:14

and Python or Zod in Node.js making it super

11:17

easy to integrate into their code.

11:19

That sounds incredibly useful for turning messy

11:22

web text into actual usable data for an AI.

11:25

It really is.

11:26

It's moving beyond just reading to actually

11:29

understanding and structuring information.

11:31

Now you mentioned earlier things like clicking

11:33

buttons or dynamic content. What if the data I

11:37

need is hidden behind something like that?

11:39

A cookie banner I need to accept or a load more

11:41

button? Is Firecrawl stuck?

11:44

Ah, good point.

11:45

No, that's where their actions feature comes in.

11:47

This is particularly available in their cloud

11:49

service. It lets you tell Firecrawl how to interact

11:52

with the page before it scrapes or extracts.

11:54

Interact how?

11:55

Like you can programmatically tell it.

11:57

Click the button with the text accept cookies,

11:59

then scroll down the page twice, then type

12:02

product specs into the search bar, then click

12:04

the search button and then extract the results.

12:06

Wow.

12:07

So it can simulate a human user.

12:09

Essentially, yes.

12:10

It can click, scroll, type, wait for elements to

12:13

appear.

12:14

It handles those interactive steps needed to actually

12:17

get to the data.

12:18

This is massive because so much web data today isn't

12:21

just sitting there statically.

12:23

You need to navigate or interact to reveal it.

12:26

Actions makes that possible even at scale with their

12:29

batch scraping features.

12:30

Okay, this toolkit sounds incredibly comprehensive,

12:33

robust, technically impressive, but also you've

12:37

explained it in a way that feels pretty clear even if

12:39

you're new to this.

12:40

So let's bring it back to the listener.

12:42

What are the big so what moments here?

12:44

Why should someone maybe just starting to explore AI

12:48

really care about Firecrawl?

12:51

What does it unlock?

12:52

Yeah, connecting it to the real world is key.

12:54

Firecrawl basically enables a whole new level of

12:57

AI applications that were just much, much harder

12:59

before.

13:00

Think about smarter AI chatbots.

13:02

Imagine the systems that don't just rely on old

13:05

training data but can actively look up the latest

13:08

info on the web to give you truly current accurate

13:10

answer.

13:11

Right.

13:11

No more.

13:11

Sorry, my knowledge cut off.

13:13

Exactly.

13:14

Or think about lead enrichment for sales or

13:15

marketing.

13:16

How would that work?

13:16

A tool could automatically visit a potential

13:18

customer's website, use extract to pull out their

13:21

industry, recent news, maybe key contacts, all

13:24

automatically updating the sales database.

13:27

Saving hours of manual research.

13:29

Totally. Or consider deep research, academics,

13:32

market analysts.

13:34

They could use crawl and extract to gather and

13:37

structure vast amounts of information from articles,

13:41

reports, industry sites, getting a comprehensive

13:44

view in potentially minutes instead of weeks.

13:47

It really does sound like giving your AI a

13:49

super powered always on research assistant that

13:52

actually understands how to use the web properly.

13:55

That's huge.

13:56

So is this accessible?

13:58

Can someone listening right now actually try this

14:00

out?

14:00

Or is it only for big companies with huge budgets?

14:03

No, that's one of the great things.

14:04

It's designed to be very accessible.

14:06

There are two main ways to use it.

14:07

First, there's a powerful open source version.

14:10

The code is freely available under licenses like

14:12

AGPL and MIT.

14:14

So if you're comfortable running software yourself,

14:16

you can open source.

14:17

That's great for developers or tinkerers.

14:19

Definitely.

14:19

But if you don't want to manage servers and

14:21

infrastructure, they also have a fully hosted

14:23

cloud offering at firecrawl.dev.

14:25

You just sign up and use the API.

14:27

Okay, so options for different needs.

14:28

Exactly.

14:29

And it's built for developers.

14:30

They provide easy to use toolkits, SDKs for

14:34

Python and Node.js, which are very popular

14:36

languages.

14:36

Plus it plugs right into many popular AI

14:40

frameworks people are already using like

14:41

Langchain, Llama Index, DeFi, even low code

14:44

tools like Zapier.

14:45

So it integrates well.

14:46

Very well.

14:47

And they just raised significant funding

14:49

series A and released firecrawl v2.

14:51

So there's a lot of active development and

14:53

support behind it.

14:54

It's definitely a tool on the rise.

14:55

Okay, firecrawl is clearly doing something

14:58

important here taking this messy complex web

15:01

and making its data clean structured and

15:03

readily available for AI.

15:05

It feels like it removes a major bottleneck

15:07

making AI tools potentially much smarter

15:09

and more reliable and easier for more people

15:11

to build.

15:12

It really does and that leads to a fascinating

15:14

thought doesn't it if we significantly lower

15:17

the barrier to accessing and structuring

15:20

pretty much all the information on the web

15:21

for AI what happens next would completely new

15:24

kinds of applications might emerge.

15:26

What insights could we uncover that were just

15:29

impossible before how does this fundamentally

15:32

change how we gather knowledge how we do

15:34

research how we build intelligence systems

15:36

in the very near future.

15:39

It really pushes us to think beyond what

15:41

AI can do today.

15:42

That is a really thought-provoking question

15:45

to end on what does happen when AI can truly

15:48

read and understand the live web.

15:50

Thank you so much for walking us through

15:51

FireCrawl today.

15:52

Hope everyone listening feels much more clued in

15:54

and maybe even a bit inspired about the

15:56

possibilities here and one final big

15:59

thank you to our sponsor SafeServer.

16:00

They provide the essential hosting for advanced

16:03

software like this supporting your digital

16:04

transformation reliably and securely.

16:06

Find out more about them at www.safeserver.de

16:10

keep learning and definitely keep being curious.

16:10

keep learning and definitely keep being curious.