Today's Deep-Dive: Magnitude
R. 326

Today's Deep-Dive: Magnitude

Deskrivadur ar rann

Magnitude is an open-source, vision-first browser agent that uses artificial intelligence to control web browsers with natural language. Unlike traditional automation tools that rely on fragile DOM structures, Magnitude employs a vision AI to “see” and understand web pages like a human, making automation more reliable and less prone to breaking when websites change. Its architecture is built around a visually grounded LLM that connects language commands with visual input, directing actions like clicks via pixel coordinates rather than element IDs. The project is divided into four key capabilities: navigate, interact, extract, and verify, allowing for high-level planning, precise execution, structured data extraction using Zod schemas, and visual assertion-based testing. Magnitude addresses the brittleness and lack of control common in older automation tools by offering fine-grained controllability and deterministic runs through caching. While it requires significant AI processing power, typically using models like Cloud Sonnet 4, it offers a streamlined setup process for beginners and integration options for existing projects. The vision-first approach has the potential to revolutionize web automation and system integration by enabling interaction with any visual interface through natural language, potentially reducing the need for custom APIs.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now!

Download transcript (.srt)
0:00

Okay, let's unpack this. Today we are diving deep into something pretty exciting in

0:04

web tech.

0:05

It's about controlling web browsers using real artificial intelligence. We're

0:11

looking at this

0:11

fascinating open source project. You might have seen it on GitHub called Magnitude.

0:15

It's a vision

0:16

first browser agent. Now, if you're someone who's, you know, constantly battling

0:20

fragile web

0:21

automation, maybe you're trying to scrape data or run integration tests or just

0:26

automate some

0:27

repetitive clicking, well, you're definitely going to want to listen in. Our

0:31

mission today is really

0:32

to understand how a tool like this can actually see and understand a web page well

0:36

enough to

0:37

handle complex tasks reliably. We want to get why this vision first thing is

0:41

apparently so much

0:42

better. And importantly, we want to make sure that even if you're sort of new to

0:45

this, you get a

0:46

clear idea of how you could start using this kind of power. But before we really

0:50

get into the nuts

0:51

and bolts, the architecture and all that, let's just take a moment to thank our

0:54

supporter.

0:55

This deep dive is made possible by SafeServer. SafeServer is all about hosting

0:59

software and

1:00

helping out with digital transformation. So if you're thinking about hosting

1:02

solutions,

1:03

especially for cutting edge stuff like these browser agents, check them out.

1:07

You can find out more at www.safeserver.de. Right, so back to the magnitude. The

1:13

core promise here,

1:14

it sounds almost too good to be true. Using just natural language, plain English to

1:19

control a

1:20

browser and have it actually work reliably even as the site changes underneath.

1:23

That really is the

1:24

core of it, yeah. And that reliability promise, it comes directly from how it's

1:27

built, its whole

1:28

philosophy. Moving beyond just the code. Just for context, right? A browser agent

1:33

is basically

1:34

software that does web tasks for you. Think of it like your digital assistant for

1:39

the web.

1:39

People use them for all sorts, like running really complex tests from start to

1:43

finish,

1:44

or maybe connecting to online services that don't have a proper API to talk to each

1:48

other.

1:49

Okay. And anyone who's tried, say, web scraping or running tests with the older

1:54

tools, maybe

1:54

Selenium or something similar, they know the pain points. It all depends on the

1:58

website's

1:58

code structure, right? The DOM. But tell me, why is relying on that DOM structure

2:02

such a recipe for,

2:03

well, headaches? This is really problem number one we need to tackle.

2:07

Yeah. It really boils down to just one word. Riddleness. Traditional agents, they

2:12

look at

2:12

that hidden structure, the code, and they try to click on things or type into boxes

2:17

by finding

2:17

their specific name or ID in that den. They're basically drawing numbered boxes

2:22

around things

2:22

based on the underlying HTML. You can't see the boxes, but that's how the agent

2:26

finds things.

2:27

But here's the problem. Modern websites are incredibly dynamic. They change all the

2:32

time.

2:33

A developer might run an A-B test, shift things around, update a tiny bit of code,

2:37

and bam,

2:38

the automation breaks instantly. It just doesn't generalize well because it's

2:41

totally dependent

2:42

on those hidden code details, not on what the user actually sees and interacts with.

2:46

So the automation script only works if the website is basically frozen in time,

2:51

which, let's be honest, never happens. Precisely. Magnitude just completely

2:55

sidesteps that whole dependency. It uses a vision AI, think of it like artificial

2:59

eyes,

3:00

to actually see and understand the layout, the interface, just like you or I would.

3:04

It doesn't

3:04

really care what the code is doing underneath. That is a massive shift. So we're

3:08

not looking at

3:08

element IDs or class names anymore. We're looking at the actual pixels, the visual

3:14

arrangement on the screen. How does that work technically? How does it make it more

3:18

robust?

3:18

Well, the architecture is centered around what's called a visually grounded LLM.

3:23

That's a large language model, an AI that's been specifically trained to connect

3:28

language commands

3:28

like click the checkout button with visual input from the screen. And here's the

3:33

absolute key

3:34

detail. Instead of trying to find some fragile code ID for that button, the LLM

3:40

tells the system

3:41

where to click using precise pixel coordinates X and Y on the screen. So the agent

3:46

sees the thing

3:46

that looks like a checkout button in the right context, and it directs the mouse

3:50

click right to

3:51

that spot on the screen. The code behind it could change completely, but as long as

3:54

the button looks

3:55

like a button and is where you'd expect it, the action works. Okay, got it. So if

3:59

the button's

3:59

ID changes from, I don't know, button one toe to three to button ADC, the old way

4:04

breaks. But

4:05

magnitude just sees the button shape and text and clicks in the right place.

4:09

Exactly. And if

4:09

you think bigger picture for a second, because this whole approach relies purely on

4:13

what's visually

4:14

on the screen, it's kind of inherently future-proof, isn't it? I mean, you could

4:18

potentially use this

4:19

same idea for automating tasks inside desktop apps, or even controlling things

4:23

inside a virtual

4:24

machine where there's no DOM at all. Wow. Yeah, the potential beyond just web

4:28

browsers is huge.

4:29

Okay, let's get practical. So the architecture is the brain that sees. What about

4:34

the arms and

4:36

light? How does it actually do things? The source material breaks it down into four

4:40

key capabilities,

4:41

or pillars. Yeah, it's a really nice modular design, good for developers because

4:45

things

4:45

are clearly separated. Okay, pillar one is navigate, the little compass icon. Right,

4:50

that's the high-level planner. It uses that visual understanding to figure out the

4:54

steps

4:54

needed to get from A to B based on your natural language goal. It understands the

4:59

journey,

4:59

so to speak. Pillar two, interact, the mouse pointer icon. This is the action bit,

5:05

right? Making

5:06

the precise clicks, typing things in, even complex stuff like dragging and dropping.

5:10

Exactly, that's

5:11

the execution layer. Does the clicking, the typing, moving the mouse precisely. And

5:15

pillar three,

5:16

extract, the magnifying glass. This sounds crucial for anyone needing data. It's

5:22

about pulling

5:22

structured info out of that visual mess. Yeah, intelligently grabbing the useful

5:27

bits of structured

5:28

data from the page. And finally, pillar four, verify, the check mark. This sounds

5:35

like it's

5:35

for testing, making sure things actually worked. That's right. It integrates a test

5:40

runner with,

5:40

and this is cool, powerful visual assertions. So you can automate a process and

5:45

then use the

5:45

same vision AI to check if the visual result is what you expected. Like, did that

5:49

green success

5:50

message actually appear? Okay, let's look at how flexible this is. The examples

5:54

given show it can

5:55

handle really broad goals, but also very specific fiddly actions. Like, for a high-level

6:00

goal,

6:00

you could just say something like await agent dot act, create a task, data, title,

6:05

use magnitude,

6:06

description, and magnitude just figures it out. Finds the fields, clicks the button.

6:11

Pretty much,

6:11

yes. It interprets create a task in the context of the screen and plans the

6:16

necessary navigation

6:17

and interaction steps itself. But then if you need super fine control, it seems it

6:22

can handle

6:23

that too. Like the example, await agent dot act, drag use magnitude to the top of

6:27

the in progress

6:28

column. That's not just finding an element, that's understanding spatial stuff,

6:32

right? Columns,

6:34

positions. Exactly. That drag and drop based purely on visual understanding and

6:38

natural language

6:39

is pretty powerful. It really shows the level of comprehension.

6:43

It does feel a bit like magic. Now let's dig into that extract pillar a bit more.

6:47

The source implies it's more than just grabbing raw text. How does it get

6:51

structured data?

6:52

Ah, yes. This is where it really shines for serious use cases, moving beyond basic

6:58

scraping.

6:58

It guarantees structure by matching the content it finds visually against a predefined

7:03

Zod schema

7:04

you provide. Okay, Zod schema. For listeners maybe not deep into TypeScript

7:08

development,

7:08

can you break that down? Sounds a bit technical. Sure, absolutely. Think of the Zod

7:12

schema as just

7:13

a strict blueprint or maybe a contract for the data you want. You tell Magnitude

7:18

exactly what

7:18

pieces of information you're looking for, say a title, a date, a price, and

7:22

importantly what

7:23

format they should be in, like text, number, date. This forces the data pulled from

7:27

the website to

7:28

come out perfectly structured, predictable, ready to plug straight into another

7:32

system or database.

7:33

No messy cleanup needed. And what's really interesting, sometimes even insightful,

7:38

is that the agent can use this schema for more than just retrieving what's already

7:42

there.

7:42

The example given is defining a field in the schema called difficulty, expecting a

7:46

number

7:47

from one to five, so magnitude is told. Extract the tasks title and description,

7:51

which are on the

7:52

page, and also rate the tasks difficulty from one to five based on what you read

7:56

according to the

7:56

schema. It's interpreting the content and adding new structured insight. That's

8:00

incredible. Not just

8:02

pulling data, but categorizing or interpreting it based on a template. That really

8:07

does sound

8:07

like what you'd need for complex, reliable flows. Which brings us neatly to problem

8:12

number two,

8:14

why typical agents often fail in real-world production scenarios. Yes, exactly. The

8:20

second

8:20

big weakness of traditional automation, besides brittleness, is often a lack of

8:24

real control and

8:25

predictability. Many agents, especially some simpler ones you see, kind of follow

8:29

this opaque

8:30

loop. You give it a high-level prompt, it uses some tools, and it just tries until

8:34

it thinks it's done.

8:35

That might look impressive in a quick demo video, right? But what happens when a

8:38

real website throws

8:39

up an unexpected pop-up, or loads slowly, or hits you with a cappy THA? Those

8:43

simple demo agents often

8:45

just fall over unpredictably. They lack the fine-grained control needed for

8:48

business critical

8:49

stuff. So, Magnitude tackles this by focusing on controllability and repeatability.

8:54

How does that

8:55

actually feel for the person writing the automation script? It gives the developer

8:58

choices, flexible

8:59

levels of abstraction. If you're feeling confident about a simple step, sure, give

9:04

the agent a high

9:04

level task like, complete the checkout, let it figure it out. But crucially, if you

9:09

need rock

9:10

solid reliability for a tricky part, you can break it down. You can tell it

9:13

precisely. Okay, first,

9:15

fill in field A with this text. Now, wait until you visually see that the text is

9:20

confirmed.

9:21

Then, click the next step button, which should be around these coordinates. Oh, and

9:25

if you see an

9:26

error message pop up, then try doing action Z instead. Ah, okay. So, you're not

9:30

just throwing

9:31

a command into a black box and hoping for the best. You can guide it step by step

9:35

when needed,

9:36

handle errors predictably. Exactly right. That detailed control is essential for

9:41

building

9:41

automation you can actually trust in a production system. You can build in proper

9:45

error handling,

9:46

logging, auditing, all the things you need for serious applications. And for true

9:50

repeatability,

9:51

thinking especially about automated testing, the source mentioned something about

9:55

deterministic

9:56

runs via caching. That sounds important. Oh, hugely important. It's noted as in

10:01

progress,

10:02

but that's potentially a game changer for test suites. If you run the same test a

10:06

thousand times,

10:07

you need the exact same result a thousand times, assuming the website hasn't

10:10

changed.

10:11

A native caching system would essentially stabilize certain visual interpretations

10:16

or navigation choices the AI makes, ensuring that for a given input and a website

10:20

state,

10:21

the outcome is perfectly predictable, truly deterministic. We should also touch on

10:26

performance. This isn't just a cool idea, right? It's been benchmarked. It has,

10:30

yeah. Magnitude

10:31

performs very well. It's considered state of the art, actually. It scored an

10:34

impressive 94% on the

10:36

Web Voyager benchmark. And Web Voyager isn't trivial. It tests agents across a

10:40

really wide

10:41

range of complex, real-world web tasks. Getting a score that high is a strong

10:46

signal that this

10:47

approach is robust. Okay, but there's a catch, isn't there? A technical requirement.

10:51

Being vision

10:51

first means it needs serious AI muscle behind it. It won't run on my laptop's basic

10:56

CPU.

10:56

That's correct. Since it fundamentally relies on seeing and interpreting the screen

11:00

visually,

11:01

it needs a large, powerful, visually grounded model to do that interpretation.

11:05

The documentation specifically recommends using Cloud Sonnet 4 for the best results

11:10

right now.

11:11

Seems like it gives the highest quality visual understanding. However, it's also

11:15

compatible with

11:16

open models, specifically mentioning Quinn 2.5VL72B. But yes, you need access to

11:22

one of these

11:23

quite sophisticated visual AI engines. That's what's doing the actual seeing.

11:27

Right. That makes sense. So for our listeners who are thinking,

11:30

okay, this sounds amazing. I want to try it. Maybe someone just starting out.

11:34

What's the easiest way to dip their toes in? They've actually made the getting

11:38

started

11:38

process really smooth. To just create your first basic automation script, there's a

11:42

simple command

11:43

npx create magnitude app. That one command sets up a new project, handles the

11:49

configuration,

11:50

and crucially, it drops in a working example script right away. So beginners get

11:53

something

11:54

tangible they can run and tinker with immediately. That's great. And what about

11:57

developers who

11:58

already have, say, a web app and want to use magnitude for testing it? Maybe use

12:04

those visual

12:04

assertions. Yeah. If you're integrating it into an existing project, primarily for

12:08

testing,

12:09

the commands are slightly different. It's ntm isave dev magnitude test to install

12:13

it as a

12:14

development dependency, followed by npx magnitude init. That init command sets up

12:19

the necessary

12:20

configuration files so you can start writing those reliable vision-based tests

12:24

pretty much

12:24

straight away. Okay. We have covered a lot of ground. We've seen how magnitude aims

12:28

to

12:28

solve those two huge problems in browser automation. First, that brittleness issue

12:34

by swapping numbered boxes for pixel coordinates and actual vision. And second,

12:38

that lack of

12:38

production-ready reliability by focusing on fine-grained controllability and

12:43

ensuring

12:43

structured output using things like Zod schemas. Yeah, the whole vision-first open-source

12:48

approach

12:48

really does feel like it could change the game for how we think about reliable

12:51

automation and

12:52

maybe even system integration. So here's the final thought to leave you with. If a

12:56

tool can reliably

12:57

see, understand, and interact with any visual interface, web page, desktop, app,

13:02

whatever,

13:03

just using plain language, and it doesn't really care about the messy code

13:06

underneath,

13:07

what does that imply for the future? Does it maybe reduce the need for complex

13:12

custom-built

13:13

APIs for getting systems to talk to each other? If you can just tell an agent to

13:17

use the interface

13:18

like a human would, that's definitely something worth mulling over. Thank you so

13:22

much for joining

13:22

us on this deep dive. Remember, this show is supported by Safe Server. Safe Server

13:26

supports

13:27

your digital transformation needs and handles software hosting perfect for

13:30

technologies like Magnitude. Find out more at www.safeserver.de

13:30

technologies like Magnitude. Find out more at www.safeserver.de