1 00:00:00,000 --> 00:00:04,320 Okay, let's unpack this. Today we are diving deep into something pretty exciting in 2 00:00:04,320 --> 00:00:05,280 web tech. 3 00:00:05,280 --> 00:00:11,200 It's about controlling web browsers using real artificial intelligence. We're 4 00:00:11,200 --> 00:00:11,760 looking at this 5 00:00:11,760 --> 00:00:15,740 fascinating open source project. You might have seen it on GitHub called Magnitude. 6 00:00:15,740 --> 00:00:16,240 It's a vision 7 00:00:16,240 --> 00:00:20,750 first browser agent. Now, if you're someone who's, you know, constantly battling 8 00:00:20,750 --> 00:00:21,520 fragile web 9 00:00:21,520 --> 00:00:26,900 automation, maybe you're trying to scrape data or run integration tests or just 10 00:00:26,900 --> 00:00:27,760 automate some 11 00:00:27,760 --> 00:00:31,260 repetitive clicking, well, you're definitely going to want to listen in. Our 12 00:00:31,260 --> 00:00:32,160 mission today is really 13 00:00:32,160 --> 00:00:36,610 to understand how a tool like this can actually see and understand a web page well 14 00:00:36,610 --> 00:00:37,040 enough to 15 00:00:37,040 --> 00:00:41,340 handle complex tasks reliably. We want to get why this vision first thing is 16 00:00:41,340 --> 00:00:42,480 apparently so much 17 00:00:42,480 --> 00:00:45,660 better. And importantly, we want to make sure that even if you're sort of new to 18 00:00:45,660 --> 00:00:46,400 this, you get a 19 00:00:46,400 --> 00:00:50,420 clear idea of how you could start using this kind of power. But before we really 20 00:00:50,420 --> 00:00:51,200 get into the nuts 21 00:00:51,200 --> 00:00:54,070 and bolts, the architecture and all that, let's just take a moment to thank our 22 00:00:54,070 --> 00:00:54,720 supporter. 23 00:00:55,680 --> 00:00:59,760 This deep dive is made possible by SafeServer. SafeServer is all about hosting 24 00:00:59,760 --> 00:01:00,240 software and 25 00:01:00,240 --> 00:01:02,880 helping out with digital transformation. So if you're thinking about hosting 26 00:01:02,880 --> 00:01:03,440 solutions, 27 00:01:03,440 --> 00:01:07,040 especially for cutting edge stuff like these browser agents, check them out. 28 00:01:07,040 --> 00:01:13,770 You can find out more at www.safeserver.de. Right, so back to the magnitude. The 29 00:01:13,770 --> 00:01:14,880 core promise here, 30 00:01:14,880 --> 00:01:19,620 it sounds almost too good to be true. Using just natural language, plain English to 31 00:01:19,620 --> 00:01:20,000 control a 32 00:01:20,000 --> 00:01:23,540 browser and have it actually work reliably even as the site changes underneath. 33 00:01:23,540 --> 00:01:24,160 That really is the 34 00:01:24,160 --> 00:01:27,780 core of it, yeah. And that reliability promise, it comes directly from how it's 35 00:01:27,780 --> 00:01:28,720 built, its whole 36 00:01:28,720 --> 00:01:33,490 philosophy. Moving beyond just the code. Just for context, right? A browser agent 37 00:01:33,490 --> 00:01:34,240 is basically 38 00:01:34,240 --> 00:01:39,180 software that does web tasks for you. Think of it like your digital assistant for 39 00:01:39,180 --> 00:01:39,920 the web. 40 00:01:39,920 --> 00:01:43,730 People use them for all sorts, like running really complex tests from start to 41 00:01:43,730 --> 00:01:44,080 finish, 42 00:01:44,080 --> 00:01:48,680 or maybe connecting to online services that don't have a proper API to talk to each 43 00:01:48,680 --> 00:01:49,360 other. 44 00:01:49,360 --> 00:01:54,190 Okay. And anyone who's tried, say, web scraping or running tests with the older 45 00:01:54,190 --> 00:01:54,640 tools, maybe 46 00:01:54,640 --> 00:01:58,320 Selenium or something similar, they know the pain points. It all depends on the 47 00:01:58,320 --> 00:01:58,720 website's 48 00:01:58,720 --> 00:02:02,880 code structure, right? The DOM. But tell me, why is relying on that DOM structure 49 00:02:02,880 --> 00:02:03,920 such a recipe for, 50 00:02:03,920 --> 00:02:07,920 well, headaches? This is really problem number one we need to tackle. 51 00:02:07,920 --> 00:02:12,480 Yeah. It really boils down to just one word. Riddleness. Traditional agents, they 52 00:02:12,480 --> 00:02:12,720 look at 53 00:02:12,720 --> 00:02:17,350 that hidden structure, the code, and they try to click on things or type into boxes 54 00:02:17,350 --> 00:02:17,840 by finding 55 00:02:17,840 --> 00:02:22,150 their specific name or ID in that den. They're basically drawing numbered boxes 56 00:02:22,150 --> 00:02:22,880 around things 57 00:02:22,880 --> 00:02:26,690 based on the underlying HTML. You can't see the boxes, but that's how the agent 58 00:02:26,690 --> 00:02:27,680 finds things. 59 00:02:27,680 --> 00:02:32,800 But here's the problem. Modern websites are incredibly dynamic. They change all the 60 00:02:32,800 --> 00:02:33,280 time. 61 00:02:33,280 --> 00:02:37,520 A developer might run an A-B test, shift things around, update a tiny bit of code, 62 00:02:37,520 --> 00:02:38,560 and bam, 63 00:02:38,560 --> 00:02:41,700 the automation breaks instantly. It just doesn't generalize well because it's 64 00:02:41,700 --> 00:02:42,320 totally dependent 65 00:02:42,320 --> 00:02:46,880 on those hidden code details, not on what the user actually sees and interacts with. 66 00:02:46,880 --> 00:02:51,200 So the automation script only works if the website is basically frozen in time, 67 00:02:51,200 --> 00:02:55,600 which, let's be honest, never happens. Precisely. Magnitude just completely 68 00:02:55,600 --> 00:02:59,760 sidesteps that whole dependency. It uses a vision AI, think of it like artificial 69 00:02:59,760 --> 00:03:00,320 eyes, 70 00:03:00,320 --> 00:03:04,250 to actually see and understand the layout, the interface, just like you or I would. 71 00:03:04,250 --> 00:03:04,560 It doesn't 72 00:03:04,560 --> 00:03:08,320 really care what the code is doing underneath. That is a massive shift. So we're 73 00:03:08,320 --> 00:03:08,960 not looking at 74 00:03:08,960 --> 00:03:14,080 element IDs or class names anymore. We're looking at the actual pixels, the visual 75 00:03:14,080 --> 00:03:18,270 arrangement on the screen. How does that work technically? How does it make it more 76 00:03:18,270 --> 00:03:18,880 robust? 77 00:03:18,880 --> 00:03:23,680 Well, the architecture is centered around what's called a visually grounded LLM. 78 00:03:23,680 --> 00:03:28,070 That's a large language model, an AI that's been specifically trained to connect 79 00:03:28,070 --> 00:03:28,880 language commands 80 00:03:28,880 --> 00:03:33,590 like click the checkout button with visual input from the screen. And here's the 81 00:03:33,590 --> 00:03:34,480 absolute key 82 00:03:34,480 --> 00:03:40,030 detail. Instead of trying to find some fragile code ID for that button, the LLM 83 00:03:40,030 --> 00:03:41,120 tells the system 84 00:03:41,120 --> 00:03:46,120 where to click using precise pixel coordinates X and Y on the screen. So the agent 85 00:03:46,120 --> 00:03:46,720 sees the thing 86 00:03:46,720 --> 00:03:50,470 that looks like a checkout button in the right context, and it directs the mouse 87 00:03:50,470 --> 00:03:51,120 click right to 88 00:03:51,120 --> 00:03:54,670 that spot on the screen. The code behind it could change completely, but as long as 89 00:03:54,670 --> 00:03:55,360 the button looks 90 00:03:55,360 --> 00:03:59,060 like a button and is where you'd expect it, the action works. Okay, got it. So if 91 00:03:59,060 --> 00:03:59,760 the button's 92 00:03:59,760 --> 00:04:04,150 ID changes from, I don't know, button one toe to three to button ADC, the old way 93 00:04:04,150 --> 00:04:05,120 breaks. But 94 00:04:05,120 --> 00:04:09,070 magnitude just sees the button shape and text and clicks in the right place. 95 00:04:09,070 --> 00:04:09,920 Exactly. And if 96 00:04:09,920 --> 00:04:13,340 you think bigger picture for a second, because this whole approach relies purely on 97 00:04:13,340 --> 00:04:14,160 what's visually 98 00:04:14,160 --> 00:04:18,520 on the screen, it's kind of inherently future-proof, isn't it? I mean, you could 99 00:04:18,520 --> 00:04:19,440 potentially use this 100 00:04:19,440 --> 00:04:23,660 same idea for automating tasks inside desktop apps, or even controlling things 101 00:04:23,660 --> 00:04:24,560 inside a virtual 102 00:04:24,560 --> 00:04:28,280 machine where there's no DOM at all. Wow. Yeah, the potential beyond just web 103 00:04:28,280 --> 00:04:29,280 browsers is huge. 104 00:04:29,280 --> 00:04:34,710 Okay, let's get practical. So the architecture is the brain that sees. What about 105 00:04:34,710 --> 00:04:36,080 the arms and 106 00:04:36,080 --> 00:04:40,320 light? How does it actually do things? The source material breaks it down into four 107 00:04:40,320 --> 00:04:41,440 key capabilities, 108 00:04:41,440 --> 00:04:45,280 or pillars. Yeah, it's a really nice modular design, good for developers because 109 00:04:45,280 --> 00:04:45,520 things 110 00:04:45,520 --> 00:04:50,480 are clearly separated. Okay, pillar one is navigate, the little compass icon. Right, 111 00:04:50,480 --> 00:04:54,420 that's the high-level planner. It uses that visual understanding to figure out the 112 00:04:54,420 --> 00:04:54,800 steps 113 00:04:54,800 --> 00:04:59,110 needed to get from A to B based on your natural language goal. It understands the 114 00:04:59,110 --> 00:04:59,760 journey, 115 00:04:59,760 --> 00:05:05,150 so to speak. Pillar two, interact, the mouse pointer icon. This is the action bit, 116 00:05:05,150 --> 00:05:06,080 right? Making 117 00:05:06,080 --> 00:05:10,620 the precise clicks, typing things in, even complex stuff like dragging and dropping. 118 00:05:10,620 --> 00:05:11,200 Exactly, that's 119 00:05:11,200 --> 00:05:15,800 the execution layer. Does the clicking, the typing, moving the mouse precisely. And 120 00:05:15,800 --> 00:05:16,800 pillar three, 121 00:05:16,800 --> 00:05:22,270 extract, the magnifying glass. This sounds crucial for anyone needing data. It's 122 00:05:22,270 --> 00:05:22,880 about pulling 123 00:05:22,880 --> 00:05:27,440 structured info out of that visual mess. Yeah, intelligently grabbing the useful 124 00:05:27,440 --> 00:05:28,400 bits of structured 125 00:05:28,400 --> 00:05:35,360 data from the page. And finally, pillar four, verify, the check mark. This sounds 126 00:05:35,360 --> 00:05:35,680 like it's 127 00:05:35,680 --> 00:05:40,230 for testing, making sure things actually worked. That's right. It integrates a test 128 00:05:40,230 --> 00:05:40,800 runner with, 129 00:05:40,800 --> 00:05:45,360 and this is cool, powerful visual assertions. So you can automate a process and 130 00:05:45,360 --> 00:05:45,760 then use the 131 00:05:45,760 --> 00:05:49,940 same vision AI to check if the visual result is what you expected. Like, did that 132 00:05:49,940 --> 00:05:50,640 green success 133 00:05:50,640 --> 00:05:54,550 message actually appear? Okay, let's look at how flexible this is. The examples 134 00:05:54,550 --> 00:05:55,280 given show it can 135 00:05:55,280 --> 00:06:00,410 handle really broad goals, but also very specific fiddly actions. Like, for a high-level 136 00:06:00,410 --> 00:06:00,800 goal, 137 00:06:00,800 --> 00:06:05,650 you could just say something like await agent dot act, create a task, data, title, 138 00:06:05,650 --> 00:06:06,480 use magnitude, 139 00:06:06,480 --> 00:06:11,420 description, and magnitude just figures it out. Finds the fields, clicks the button. 140 00:06:11,420 --> 00:06:11,680 Pretty much, 141 00:06:11,680 --> 00:06:16,720 yes. It interprets create a task in the context of the screen and plans the 142 00:06:16,720 --> 00:06:17,920 necessary navigation 143 00:06:17,920 --> 00:06:22,880 and interaction steps itself. But then if you need super fine control, it seems it 144 00:06:22,880 --> 00:06:23,200 can handle 145 00:06:23,200 --> 00:06:27,670 that too. Like the example, await agent dot act, drag use magnitude to the top of 146 00:06:27,670 --> 00:06:28,560 the in progress 147 00:06:28,560 --> 00:06:32,750 column. That's not just finding an element, that's understanding spatial stuff, 148 00:06:32,750 --> 00:06:34,080 right? Columns, 149 00:06:34,080 --> 00:06:38,160 positions. Exactly. That drag and drop based purely on visual understanding and 150 00:06:38,160 --> 00:06:39,360 natural language 151 00:06:39,360 --> 00:06:43,280 is pretty powerful. It really shows the level of comprehension. 152 00:06:43,280 --> 00:06:47,680 It does feel a bit like magic. Now let's dig into that extract pillar a bit more. 153 00:06:47,680 --> 00:06:51,570 The source implies it's more than just grabbing raw text. How does it get 154 00:06:51,570 --> 00:06:52,480 structured data? 155 00:06:52,480 --> 00:06:58,080 Ah, yes. This is where it really shines for serious use cases, moving beyond basic 156 00:06:58,080 --> 00:06:58,800 scraping. 157 00:06:58,800 --> 00:07:03,440 It guarantees structure by matching the content it finds visually against a predefined 158 00:07:03,440 --> 00:07:04,080 Zod schema 159 00:07:04,080 --> 00:07:08,160 you provide. Okay, Zod schema. For listeners maybe not deep into TypeScript 160 00:07:08,160 --> 00:07:08,480 development, 161 00:07:08,480 --> 00:07:12,580 can you break that down? Sounds a bit technical. Sure, absolutely. Think of the Zod 162 00:07:12,580 --> 00:07:13,760 schema as just 163 00:07:13,760 --> 00:07:18,320 a strict blueprint or maybe a contract for the data you want. You tell Magnitude 164 00:07:18,320 --> 00:07:18,960 exactly what 165 00:07:18,960 --> 00:07:22,570 pieces of information you're looking for, say a title, a date, a price, and 166 00:07:22,570 --> 00:07:23,520 importantly what 167 00:07:23,520 --> 00:07:27,790 format they should be in, like text, number, date. This forces the data pulled from 168 00:07:27,790 --> 00:07:28,560 the website to 169 00:07:28,560 --> 00:07:32,750 come out perfectly structured, predictable, ready to plug straight into another 170 00:07:32,750 --> 00:07:33,920 system or database. 171 00:07:33,920 --> 00:07:38,000 No messy cleanup needed. And what's really interesting, sometimes even insightful, 172 00:07:38,000 --> 00:07:42,160 is that the agent can use this schema for more than just retrieving what's already 173 00:07:42,160 --> 00:07:42,800 there. 174 00:07:42,800 --> 00:07:46,960 The example given is defining a field in the schema called difficulty, expecting a 175 00:07:46,960 --> 00:07:47,280 number 176 00:07:47,280 --> 00:07:51,510 from one to five, so magnitude is told. Extract the tasks title and description, 177 00:07:51,510 --> 00:07:52,000 which are on the 178 00:07:52,000 --> 00:07:56,300 page, and also rate the tasks difficulty from one to five based on what you read 179 00:07:56,300 --> 00:07:56,720 according to the 180 00:07:56,720 --> 00:08:00,820 schema. It's interpreting the content and adding new structured insight. That's 181 00:08:00,820 --> 00:08:02,160 incredible. Not just 182 00:08:02,160 --> 00:08:07,060 pulling data, but categorizing or interpreting it based on a template. That really 183 00:08:07,060 --> 00:08:07,520 does sound 184 00:08:07,520 --> 00:08:12,630 like what you'd need for complex, reliable flows. Which brings us neatly to problem 185 00:08:12,630 --> 00:08:13,280 number two, 186 00:08:14,240 --> 00:08:20,080 why typical agents often fail in real-world production scenarios. Yes, exactly. The 187 00:08:20,080 --> 00:08:20,400 second 188 00:08:20,400 --> 00:08:24,330 big weakness of traditional automation, besides brittleness, is often a lack of 189 00:08:24,330 --> 00:08:25,200 real control and 190 00:08:25,200 --> 00:08:29,390 predictability. Many agents, especially some simpler ones you see, kind of follow 191 00:08:29,390 --> 00:08:30,240 this opaque 192 00:08:30,240 --> 00:08:34,240 loop. You give it a high-level prompt, it uses some tools, and it just tries until 193 00:08:34,240 --> 00:08:35,120 it thinks it's done. 194 00:08:35,120 --> 00:08:38,720 That might look impressive in a quick demo video, right? But what happens when a 195 00:08:38,720 --> 00:08:39,760 real website throws 196 00:08:39,760 --> 00:08:43,870 up an unexpected pop-up, or loads slowly, or hits you with a cappy THA? Those 197 00:08:43,870 --> 00:08:45,040 simple demo agents often 198 00:08:45,040 --> 00:08:48,690 just fall over unpredictably. They lack the fine-grained control needed for 199 00:08:48,690 --> 00:08:49,600 business critical 200 00:08:49,600 --> 00:08:54,180 stuff. So, Magnitude tackles this by focusing on controllability and repeatability. 201 00:08:54,180 --> 00:08:55,120 How does that 202 00:08:55,120 --> 00:08:58,540 actually feel for the person writing the automation script? It gives the developer 203 00:08:58,540 --> 00:08:59,600 choices, flexible 204 00:08:59,600 --> 00:09:04,100 levels of abstraction. If you're feeling confident about a simple step, sure, give 205 00:09:04,100 --> 00:09:04,960 the agent a high 206 00:09:04,960 --> 00:09:09,490 level task like, complete the checkout, let it figure it out. But crucially, if you 207 00:09:09,490 --> 00:09:10,080 need rock 208 00:09:10,080 --> 00:09:13,860 solid reliability for a tricky part, you can break it down. You can tell it 209 00:09:13,860 --> 00:09:15,760 precisely. Okay, first, 210 00:09:15,760 --> 00:09:20,730 fill in field A with this text. Now, wait until you visually see that the text is 211 00:09:20,730 --> 00:09:21,520 confirmed. 212 00:09:21,520 --> 00:09:25,950 Then, click the next step button, which should be around these coordinates. Oh, and 213 00:09:25,950 --> 00:09:26,400 if you see an 214 00:09:26,400 --> 00:09:30,770 error message pop up, then try doing action Z instead. Ah, okay. So, you're not 215 00:09:30,770 --> 00:09:31,440 just throwing 216 00:09:31,440 --> 00:09:35,870 a command into a black box and hoping for the best. You can guide it step by step 217 00:09:35,870 --> 00:09:36,720 when needed, 218 00:09:36,720 --> 00:09:41,280 handle errors predictably. Exactly right. That detailed control is essential for 219 00:09:41,280 --> 00:09:41,600 building 220 00:09:41,600 --> 00:09:45,320 automation you can actually trust in a production system. You can build in proper 221 00:09:45,320 --> 00:09:46,160 error handling, 222 00:09:46,160 --> 00:09:50,160 logging, auditing, all the things you need for serious applications. And for true 223 00:09:50,160 --> 00:09:51,120 repeatability, 224 00:09:51,120 --> 00:09:55,200 thinking especially about automated testing, the source mentioned something about 225 00:09:55,200 --> 00:09:56,000 deterministic 226 00:09:56,000 --> 00:10:01,380 runs via caching. That sounds important. Oh, hugely important. It's noted as in 227 00:10:01,380 --> 00:10:02,080 progress, 228 00:10:02,080 --> 00:10:06,240 but that's potentially a game changer for test suites. If you run the same test a 229 00:10:06,240 --> 00:10:07,200 thousand times, 230 00:10:07,200 --> 00:10:10,780 you need the exact same result a thousand times, assuming the website hasn't 231 00:10:10,780 --> 00:10:11,440 changed. 232 00:10:11,440 --> 00:10:16,000 A native caching system would essentially stabilize certain visual interpretations 233 00:10:16,000 --> 00:10:20,970 or navigation choices the AI makes, ensuring that for a given input and a website 234 00:10:20,970 --> 00:10:21,440 state, 235 00:10:21,440 --> 00:10:26,320 the outcome is perfectly predictable, truly deterministic. We should also touch on 236 00:10:26,320 --> 00:10:30,170 performance. This isn't just a cool idea, right? It's been benchmarked. It has, 237 00:10:30,170 --> 00:10:31,040 yeah. Magnitude 238 00:10:31,040 --> 00:10:34,320 performs very well. It's considered state of the art, actually. It scored an 239 00:10:34,320 --> 00:10:36,160 impressive 94% on the 240 00:10:36,160 --> 00:10:40,410 Web Voyager benchmark. And Web Voyager isn't trivial. It tests agents across a 241 00:10:40,410 --> 00:10:41,280 really wide 242 00:10:41,280 --> 00:10:46,460 range of complex, real-world web tasks. Getting a score that high is a strong 243 00:10:46,460 --> 00:10:47,120 signal that this 244 00:10:47,120 --> 00:10:51,010 approach is robust. Okay, but there's a catch, isn't there? A technical requirement. 245 00:10:51,010 --> 00:10:51,520 Being vision 246 00:10:51,520 --> 00:10:56,080 first means it needs serious AI muscle behind it. It won't run on my laptop's basic 247 00:10:56,080 --> 00:10:56,560 CPU. 248 00:10:56,560 --> 00:11:00,640 That's correct. Since it fundamentally relies on seeing and interpreting the screen 249 00:11:00,640 --> 00:11:01,280 visually, 250 00:11:01,280 --> 00:11:05,680 it needs a large, powerful, visually grounded model to do that interpretation. 251 00:11:05,680 --> 00:11:10,570 The documentation specifically recommends using Cloud Sonnet 4 for the best results 252 00:11:10,570 --> 00:11:11,520 right now. 253 00:11:11,520 --> 00:11:15,930 Seems like it gives the highest quality visual understanding. However, it's also 254 00:11:15,930 --> 00:11:16,720 compatible with 255 00:11:16,720 --> 00:11:22,490 open models, specifically mentioning Quinn 2.5VL72B. But yes, you need access to 256 00:11:22,490 --> 00:11:23,520 one of these 257 00:11:23,520 --> 00:11:27,840 quite sophisticated visual AI engines. That's what's doing the actual seeing. 258 00:11:27,840 --> 00:11:30,800 Right. That makes sense. So for our listeners who are thinking, 259 00:11:30,800 --> 00:11:34,560 okay, this sounds amazing. I want to try it. Maybe someone just starting out. 260 00:11:34,560 --> 00:11:38,000 What's the easiest way to dip their toes in? They've actually made the getting 261 00:11:38,000 --> 00:11:38,400 started 262 00:11:38,400 --> 00:11:42,880 process really smooth. To just create your first basic automation script, there's a 263 00:11:42,880 --> 00:11:43,840 simple command 264 00:11:43,840 --> 00:11:49,280 npx create magnitude app. That one command sets up a new project, handles the 265 00:11:49,280 --> 00:11:50,160 configuration, 266 00:11:50,160 --> 00:11:53,920 and crucially, it drops in a working example script right away. So beginners get 267 00:11:53,920 --> 00:11:54,240 something 268 00:11:54,240 --> 00:11:57,680 tangible they can run and tinker with immediately. That's great. And what about 269 00:11:57,680 --> 00:11:58,400 developers who 270 00:11:58,400 --> 00:12:04,320 already have, say, a web app and want to use magnitude for testing it? Maybe use 271 00:12:04,320 --> 00:12:04,880 those visual 272 00:12:04,880 --> 00:12:08,530 assertions. Yeah. If you're integrating it into an existing project, primarily for 273 00:12:08,530 --> 00:12:09,040 testing, 274 00:12:09,040 --> 00:12:13,760 the commands are slightly different. It's ntm isave dev magnitude test to install 275 00:12:13,760 --> 00:12:14,080 it as a 276 00:12:14,080 --> 00:12:19,600 development dependency, followed by npx magnitude init. That init command sets up 277 00:12:19,600 --> 00:12:20,240 the necessary 278 00:12:20,240 --> 00:12:24,630 configuration files so you can start writing those reliable vision-based tests 279 00:12:24,630 --> 00:12:24,880 pretty much 280 00:12:24,880 --> 00:12:28,560 straight away. Okay. We have covered a lot of ground. We've seen how magnitude aims 281 00:12:28,560 --> 00:12:28,720 to 282 00:12:28,720 --> 00:12:34,000 solve those two huge problems in browser automation. First, that brittleness issue 283 00:12:34,000 --> 00:12:38,360 by swapping numbered boxes for pixel coordinates and actual vision. And second, 284 00:12:38,360 --> 00:12:38,880 that lack of 285 00:12:38,880 --> 00:12:43,200 production-ready reliability by focusing on fine-grained controllability and 286 00:12:43,200 --> 00:12:43,600 ensuring 287 00:12:43,600 --> 00:12:48,320 structured output using things like Zod schemas. Yeah, the whole vision-first open-source 288 00:12:48,320 --> 00:12:48,720 approach 289 00:12:48,720 --> 00:12:51,770 really does feel like it could change the game for how we think about reliable 290 00:12:51,770 --> 00:12:52,640 automation and 291 00:12:52,640 --> 00:12:56,640 maybe even system integration. So here's the final thought to leave you with. If a 292 00:12:56,640 --> 00:12:57,760 tool can reliably 293 00:12:57,760 --> 00:13:02,420 see, understand, and interact with any visual interface, web page, desktop, app, 294 00:13:02,420 --> 00:13:03,040 whatever, 295 00:13:03,040 --> 00:13:06,810 just using plain language, and it doesn't really care about the messy code 296 00:13:06,810 --> 00:13:07,600 underneath, 297 00:13:07,600 --> 00:13:12,830 what does that imply for the future? Does it maybe reduce the need for complex 298 00:13:12,830 --> 00:13:13,440 custom-built 299 00:13:13,440 --> 00:13:17,690 APIs for getting systems to talk to each other? If you can just tell an agent to 300 00:13:17,690 --> 00:13:18,480 use the interface 301 00:13:18,480 --> 00:13:22,040 like a human would, that's definitely something worth mulling over. Thank you so 302 00:13:22,040 --> 00:13:22,560 much for joining 303 00:13:22,560 --> 00:13:26,720 us on this deep dive. Remember, this show is supported by Safe Server. Safe Server 304 00:13:26,720 --> 00:13:27,040 supports 305 00:13:27,040 --> 00:13:30,670 your digital transformation needs and handles software hosting perfect for 306 00:13:30,670 --> 00:13:31,440 deploying advanced 307 00:13:31,440 --> 00:13:38,960 technologies like Magnitude. Find out more at www.safeserver.de