Today's Deep-Dive: Magnitude

0:00

Okay, let's unpack this. Today we are diving deep into something pretty exciting in

0:04

web tech.

0:05

It's about controlling web browsers using real artificial intelligence. We're

0:11

looking at this

0:11

fascinating open source project. You might have seen it on GitHub called Magnitude.

0:15

It's a vision

0:16

first browser agent. Now, if you're someone who's, you know, constantly battling

0:20

fragile web

0:21

automation, maybe you're trying to scrape data or run integration tests or just

0:26

automate some

0:27

repetitive clicking, well, you're definitely going to want to listen in. Our

0:31

mission today is really

0:32

to understand how a tool like this can actually see and understand a web page well

0:36

enough to

0:37

handle complex tasks reliably. We want to get why this vision first thing is

0:41

apparently so much

0:42

better. And importantly, we want to make sure that even if you're sort of new to

0:45

this, you get a

0:46

clear idea of how you could start using this kind of power. But before we really

0:50

get into the nuts

0:51

and bolts, the architecture and all that, let's just take a moment to thank our

0:54

supporter.

0:55

This deep dive is made possible by SafeServer. SafeServer is all about hosting

0:59

software and

1:00

helping out with digital transformation. So if you're thinking about hosting

1:02

solutions,

1:03

especially for cutting edge stuff like these browser agents, check them out.

1:07

You can find out more at www.safeserver.de. Right, so back to the magnitude. The

1:13

core promise here,

1:14

it sounds almost too good to be true. Using just natural language, plain English to

1:19

control a

1:20

browser and have it actually work reliably even as the site changes underneath.

1:23

That really is the

1:24

core of it, yeah. And that reliability promise, it comes directly from how it's

1:27

built, its whole

1:28

philosophy. Moving beyond just the code. Just for context, right? A browser agent

1:33

is basically

1:34

software that does web tasks for you. Think of it like your digital assistant for

1:39

the web.

1:39

People use them for all sorts, like running really complex tests from start to

1:43

finish,

1:44

or maybe connecting to online services that don't have a proper API to talk to each

1:48

other.

1:49

Okay. And anyone who's tried, say, web scraping or running tests with the older

1:54

tools, maybe

1:54

Selenium or something similar, they know the pain points. It all depends on the

1:58

website's

1:58

code structure, right? The DOM. But tell me, why is relying on that DOM structure

2:02

such a recipe for,

2:03

well, headaches? This is really problem number one we need to tackle.

2:07

Yeah. It really boils down to just one word. Riddleness. Traditional agents, they

2:12

look at

2:12

that hidden structure, the code, and they try to click on things or type into boxes

2:17

by finding

2:17

their specific name or ID in that den. They're basically drawing numbered boxes

2:22

around things

2:22

based on the underlying HTML. You can't see the boxes, but that's how the agent

2:26

finds things.

2:27

But here's the problem. Modern websites are incredibly dynamic. They change all the

2:32

time.

2:33

A developer might run an A-B test, shift things around, update a tiny bit of code,

2:37

and bam,

2:38

the automation breaks instantly. It just doesn't generalize well because it's

2:41

totally dependent

2:42

on those hidden code details, not on what the user actually sees and interacts with.

2:46

So the automation script only works if the website is basically frozen in time,

2:51

which, let's be honest, never happens. Precisely. Magnitude just completely

2:55

sidesteps that whole dependency. It uses a vision AI, think of it like artificial

2:59

eyes,

3:00

to actually see and understand the layout, the interface, just like you or I would.

3:04

It doesn't

3:04

really care what the code is doing underneath. That is a massive shift. So we're

3:08

not looking at

3:08

element IDs or class names anymore. We're looking at the actual pixels, the visual

3:14

arrangement on the screen. How does that work technically? How does it make it more

3:18

robust?

3:18

Well, the architecture is centered around what's called a visually grounded LLM.

3:23

That's a large language model, an AI that's been specifically trained to connect

3:28

language commands

3:28

like click the checkout button with visual input from the screen. And here's the

3:33

absolute key

3:34

detail. Instead of trying to find some fragile code ID for that button, the LLM

3:40

tells the system

3:41

where to click using precise pixel coordinates X and Y on the screen. So the agent

3:46

sees the thing

3:46

that looks like a checkout button in the right context, and it directs the mouse

3:50

click right to

3:51

that spot on the screen. The code behind it could change completely, but as long as

3:54

the button looks

3:55

like a button and is where you'd expect it, the action works. Okay, got it. So if

3:59

the button's

3:59

ID changes from, I don't know, button one toe to three to button ADC, the old way

4:04

breaks. But

4:05

magnitude just sees the button shape and text and clicks in the right place.

4:09

Exactly. And if

4:09

you think bigger picture for a second, because this whole approach relies purely on

4:13

what's visually

4:14

on the screen, it's kind of inherently future-proof, isn't it? I mean, you could

4:18

potentially use this

4:19

same idea for automating tasks inside desktop apps, or even controlling things

4:23

inside a virtual

4:24

machine where there's no DOM at all. Wow. Yeah, the potential beyond just web

4:28

browsers is huge.

4:29

Okay, let's get practical. So the architecture is the brain that sees. What about

4:34

the arms and

4:36

light? How does it actually do things? The source material breaks it down into four

4:40

key capabilities,

4:41

or pillars. Yeah, it's a really nice modular design, good for developers because

4:45

things

4:45

are clearly separated. Okay, pillar one is navigate, the little compass icon. Right,

4:50

that's the high-level planner. It uses that visual understanding to figure out the

4:54

steps

4:54

needed to get from A to B based on your natural language goal. It understands the

4:59

journey,

4:59

so to speak. Pillar two, interact, the mouse pointer icon. This is the action bit,

5:05

right? Making

5:06

the precise clicks, typing things in, even complex stuff like dragging and dropping.

5:10

Exactly, that's

5:11

the execution layer. Does the clicking, the typing, moving the mouse precisely. And

5:15

pillar three,

5:16

extract, the magnifying glass. This sounds crucial for anyone needing data. It's

5:22

about pulling

5:22

structured info out of that visual mess. Yeah, intelligently grabbing the useful

5:27

bits of structured

5:28

data from the page. And finally, pillar four, verify, the check mark. This sounds

5:35

like it's

5:35

for testing, making sure things actually worked. That's right. It integrates a test

5:40

runner with,

5:40

and this is cool, powerful visual assertions. So you can automate a process and

5:45

then use the

5:45

same vision AI to check if the visual result is what you expected. Like, did that

5:49

green success

5:50

message actually appear? Okay, let's look at how flexible this is. The examples

5:54

given show it can

5:55

handle really broad goals, but also very specific fiddly actions. Like, for a high-level

6:00

goal,

6:00

you could just say something like await agent dot act, create a task, data, title,

6:05

use magnitude,

6:06

description, and magnitude just figures it out. Finds the fields, clicks the button.

6:11

Pretty much,

6:11

yes. It interprets create a task in the context of the screen and plans the

6:16

necessary navigation

6:17

and interaction steps itself. But then if you need super fine control, it seems it

6:22

can handle

6:23

that too. Like the example, await agent dot act, drag use magnitude to the top of

6:27

the in progress

6:28

column. That's not just finding an element, that's understanding spatial stuff,

6:32

right? Columns,

6:34

positions. Exactly. That drag and drop based purely on visual understanding and

6:38

natural language

6:39

is pretty powerful. It really shows the level of comprehension.

6:43

It does feel a bit like magic. Now let's dig into that extract pillar a bit more.

6:47

The source implies it's more than just grabbing raw text. How does it get

6:51

structured data?

6:52

Ah, yes. This is where it really shines for serious use cases, moving beyond basic

6:58

scraping.

6:58

It guarantees structure by matching the content it finds visually against a predefined

7:03

Zod schema

7:04

you provide. Okay, Zod schema. For listeners maybe not deep into TypeScript

7:08

development,

7:08

can you break that down? Sounds a bit technical. Sure, absolutely. Think of the Zod

7:12

schema as just

7:13

a strict blueprint or maybe a contract for the data you want. You tell Magnitude

7:18

exactly what

7:18

pieces of information you're looking for, say a title, a date, a price, and

7:22

importantly what

7:23

format they should be in, like text, number, date. This forces the data pulled from

7:27

the website to

7:28

come out perfectly structured, predictable, ready to plug straight into another

7:32

system or database.

7:33

No messy cleanup needed. And what's really interesting, sometimes even insightful,

7:38

is that the agent can use this schema for more than just retrieving what's already

7:42

there.

7:42

The example given is defining a field in the schema called difficulty, expecting a

7:46

number

7:47

from one to five, so magnitude is told. Extract the tasks title and description,

7:51

which are on the

7:52

page, and also rate the tasks difficulty from one to five based on what you read

7:56

according to the

7:56

schema. It's interpreting the content and adding new structured insight. That's

8:00

incredible. Not just

8:02

pulling data, but categorizing or interpreting it based on a template. That really

8:07

does sound

8:07

like what you'd need for complex, reliable flows. Which brings us neatly to problem

8:12

number two,

8:14

why typical agents often fail in real-world production scenarios. Yes, exactly. The

8:20

second

8:20

big weakness of traditional automation, besides brittleness, is often a lack of

8:24

real control and

8:25

predictability. Many agents, especially some simpler ones you see, kind of follow

8:29

this opaque

8:30

loop. You give it a high-level prompt, it uses some tools, and it just tries until

8:34

it thinks it's done.

8:35

That might look impressive in a quick demo video, right? But what happens when a

8:38

real website throws

8:39

up an unexpected pop-up, or loads slowly, or hits you with a cappy THA? Those

8:43

simple demo agents often

8:45

just fall over unpredictably. They lack the fine-grained control needed for

8:48

business critical

8:49

stuff. So, Magnitude tackles this by focusing on controllability and repeatability.

8:54

How does that

8:55

actually feel for the person writing the automation script? It gives the developer

8:58

choices, flexible

8:59

levels of abstraction. If you're feeling confident about a simple step, sure, give

9:04

the agent a high

9:04

level task like, complete the checkout, let it figure it out. But crucially, if you

9:09

need rock

9:10

solid reliability for a tricky part, you can break it down. You can tell it

9:13

precisely. Okay, first,

9:15

fill in field A with this text. Now, wait until you visually see that the text is

9:20

confirmed.

9:21

Then, click the next step button, which should be around these coordinates. Oh, and

9:25

if you see an

9:26

error message pop up, then try doing action Z instead. Ah, okay. So, you're not

9:30

just throwing

9:31

a command into a black box and hoping for the best. You can guide it step by step

9:35

when needed,

9:36

handle errors predictably. Exactly right. That detailed control is essential for

9:41

building

9:41

automation you can actually trust in a production system. You can build in proper

9:45

error handling,

9:46

logging, auditing, all the things you need for serious applications. And for true

9:50

repeatability,

9:51

thinking especially about automated testing, the source mentioned something about

9:55

deterministic

9:56

runs via caching. That sounds important. Oh, hugely important. It's noted as in

10:01

progress,

10:02

but that's potentially a game changer for test suites. If you run the same test a

10:06

thousand times,

10:07

you need the exact same result a thousand times, assuming the website hasn't

10:10

changed.

10:11

A native caching system would essentially stabilize certain visual interpretations

10:16

or navigation choices the AI makes, ensuring that for a given input and a website

10:20

state,

10:21

the outcome is perfectly predictable, truly deterministic. We should also touch on

10:26

performance. This isn't just a cool idea, right? It's been benchmarked. It has,

10:30

yeah. Magnitude

10:31

performs very well. It's considered state of the art, actually. It scored an

10:34

impressive 94% on the

10:36

Web Voyager benchmark. And Web Voyager isn't trivial. It tests agents across a

10:40

really wide

10:41

range of complex, real-world web tasks. Getting a score that high is a strong

10:46

signal that this

10:47

approach is robust. Okay, but there's a catch, isn't there? A technical requirement.

10:51

Being vision

10:51

first means it needs serious AI muscle behind it. It won't run on my laptop's basic

10:56

CPU.

10:56

That's correct. Since it fundamentally relies on seeing and interpreting the screen

11:00

visually,

11:01

it needs a large, powerful, visually grounded model to do that interpretation.

11:05

The documentation specifically recommends using Cloud Sonnet 4 for the best results

11:10

right now.

11:11

Seems like it gives the highest quality visual understanding. However, it's also

11:15

compatible with

11:16

open models, specifically mentioning Quinn 2.5VL72B. But yes, you need access to

11:22

one of these

11:23

quite sophisticated visual AI engines. That's what's doing the actual seeing.

11:27

Right. That makes sense. So for our listeners who are thinking,

11:30

okay, this sounds amazing. I want to try it. Maybe someone just starting out.

11:34

What's the easiest way to dip their toes in? They've actually made the getting

11:38

started

11:38

process really smooth. To just create your first basic automation script, there's a

11:42

simple command

11:43

npx create magnitude app. That one command sets up a new project, handles the

11:49

configuration,

11:50

and crucially, it drops in a working example script right away. So beginners get

11:53

something

11:54

tangible they can run and tinker with immediately. That's great. And what about

11:57

developers who

11:58

already have, say, a web app and want to use magnitude for testing it? Maybe use

12:04

those visual

12:04

assertions. Yeah. If you're integrating it into an existing project, primarily for

12:08

testing,

12:09

the commands are slightly different. It's ntm isave dev magnitude test to install

12:13

it as a

12:14

development dependency, followed by npx magnitude init. That init command sets up

12:19

the necessary

12:20

configuration files so you can start writing those reliable vision-based tests

12:24

pretty much

12:24

straight away. Okay. We have covered a lot of ground. We've seen how magnitude aims

12:28

to

12:28

solve those two huge problems in browser automation. First, that brittleness issue

12:34

by swapping numbered boxes for pixel coordinates and actual vision. And second,

12:38

that lack of

12:38

production-ready reliability by focusing on fine-grained controllability and

12:43

ensuring

12:43

structured output using things like Zod schemas. Yeah, the whole vision-first open-source

12:48

approach

12:48

really does feel like it could change the game for how we think about reliable

12:51

automation and

12:52

maybe even system integration. So here's the final thought to leave you with. If a

12:56

tool can reliably

12:57

see, understand, and interact with any visual interface, web page, desktop, app,

13:02

whatever,

13:03

just using plain language, and it doesn't really care about the messy code

13:06

underneath,

13:07

what does that imply for the future? Does it maybe reduce the need for complex

13:12

custom-built

13:13

APIs for getting systems to talk to each other? If you can just tell an agent to

13:17

use the interface

13:18

like a human would, that's definitely something worth mulling over. Thank you so

13:22

much for joining

13:22

us on this deep dive. Remember, this show is supported by Safe Server. Safe Server

13:26

supports

13:27

your digital transformation needs and handles software hosting perfect for

13:30

technologies like Magnitude. Find out more at www.safeserver.de

13:30

technologies like Magnitude. Find out more at www.safeserver.de

Today's Deep-Dive: Magnitude

Episode description

Persons