1 00:00:00,000 --> 00:00:03,740 Welcome to the deep dive before we jump in today a quick. Thank you to our 2 00:00:03,740 --> 00:00:05,040 supporter safe server 3 00:00:05,040 --> 00:00:09,470 Safe server handles software hosting and they're really focused on supporting your 4 00:00:09,470 --> 00:00:10,820 digital transformation 5 00:00:10,820 --> 00:00:13,820 So if you're looking for reliable hosting you can check them out at 6 00:00:13,820 --> 00:00:18,320 www.safe server dot de like I said today 7 00:00:18,320 --> 00:00:22,480 We're embarking on a mission that feels a bit 8 00:00:22,480 --> 00:00:27,420 Well bit sci-fi maybe don't it leans that way we're looking at a lotto AI 9 00:00:27,460 --> 00:00:32,650 Right, and the core idea is taking sophisticated conversational AI like really 10 00:00:32,650 --> 00:00:34,980 advanced stuff and getting it out of the screen 11 00:00:34,980 --> 00:00:40,130 Out of your phone or computer and into physical things specifically toys plushies 12 00:00:40,130 --> 00:00:41,560 even it sounds simple 13 00:00:41,560 --> 00:00:45,420 But the sources suggest this is way beyond just a talking toy exactly 14 00:00:45,420 --> 00:00:48,900 We're gonna break down how they're trying to merge the hardware the software and 15 00:00:48,900 --> 00:00:51,840 these really distinct AI personalities 16 00:00:51,840 --> 00:00:56,540 The goal seems to be making these interactions feel hyper realistic. That's the 17 00:00:56,540 --> 00:00:57,100 hook isn't it? 18 00:00:57,100 --> 00:01:00,500 The source material literally says they're giving plushies voices that feel 19 00:01:00,500 --> 00:01:01,620 ridiculously real 20 00:01:01,620 --> 00:01:05,580 Yeah, it made me think of that movie Ted the talking teddy bear, right? 21 00:01:05,580 --> 00:01:10,620 But imagine that powered by actual live cutting-edge AI. That's kind of wild and 22 00:01:10,620 --> 00:01:11,700 that's what we're diving into 23 00:01:11,700 --> 00:01:15,210 It's sort of the next step for digital companions moving them into the physical 24 00:01:15,210 --> 00:01:18,200 world. Okay, let's start with the basics the hardware 25 00:01:18,200 --> 00:01:23,220 For anyone maybe new to this kind of tech. What is the a lotto device? 26 00:01:23,980 --> 00:01:29,810 Physically well at its heart. It's a small gadget an IOT client. Technically. He's 27 00:01:29,810 --> 00:01:30,580 got the microphone 28 00:01:30,580 --> 00:01:35,330 It's got the speaker but the clever part is how it attaches. Okay, it uses two 29 00:01:35,330 --> 00:01:36,780 simple silicone straps 30 00:01:36,780 --> 00:01:40,900 So you can clip it on to pretty much any toy you already have. Oh, right 31 00:01:40,900 --> 00:01:45,460 So you don't need to buy their specific toy. Nope that old teddy bear in the attic 32 00:01:45,460 --> 00:01:49,400 Yeah, suddenly you can have you know a brain and a voice that flexibility seems 33 00:01:49,400 --> 00:01:50,220 like a big deal 34 00:01:50,220 --> 00:01:55,430 It really is and the setup sounds incredibly simple aimed at well anyone no tech 35 00:01:55,430 --> 00:01:56,540 skills needed 36 00:01:56,540 --> 00:02:00,980 How simple are we talking like three steps simple first clip the device onto the 37 00:02:00,980 --> 00:02:01,460 toy? 38 00:02:01,460 --> 00:02:06,200 Okay, second connect it to your home Wi-Fi. It uses what's called a captive portal 39 00:02:06,200 --> 00:02:09,760 Uh-huh like when you connect at a hotel exactly that it makes its own little 40 00:02:09,760 --> 00:02:10,100 network 41 00:02:10,100 --> 00:02:14,670 Temporarily to guide you super easy and third pick a character personality from 42 00:02:14,670 --> 00:02:17,380 their list and just start talking to it 43 00:02:17,380 --> 00:02:21,670 Wow, okay. Now I saw they're actually two different products mentioned. Yeah, they're 44 00:02:21,670 --> 00:02:22,900 catering to slightly different people 45 00:02:22,900 --> 00:02:26,960 There's the main AI device. That's the consumer one, right pre-order price 46 00:02:26,960 --> 00:02:29,500 mentioned was $69 that gets you the device 47 00:02:29,500 --> 00:02:34,080 Access to all the AI characters unlimited apparently and a free month of their 48 00:02:34,080 --> 00:02:35,500 premium subscription 49 00:02:35,500 --> 00:02:38,800 It's the plug-and-play version and the other one for tankers 50 00:02:38,800 --> 00:02:42,980 That's the AI dev kit a bit cheaper $59 on pre-order 51 00:02:42,980 --> 00:02:47,340 This one's really for developers makers people who want to mess around with it. How 52 00:02:47,340 --> 00:02:48,980 so it has open source firmware 53 00:02:48,980 --> 00:02:54,480 Runs over a standard USB C connection and lets you load your own custom voices or 54 00:02:54,480 --> 00:02:55,900 even your own AI models 55 00:02:55,900 --> 00:02:59,950 If you want much more flexible if you're technically inclined gotcha and practical 56 00:02:59,950 --> 00:03:00,340 things 57 00:03:00,340 --> 00:03:04,980 Battery life is this thing always plugged in apparently not they claim a week of 58 00:03:04,980 --> 00:03:08,100 battery life, which is pretty good makes it actually portable 59 00:03:08,100 --> 00:03:12,310 Yeah, that's essential if it's meant to be a companion, and I saw something about 60 00:03:12,310 --> 00:03:15,420 community support uh-huh over 1200 stars 61 00:03:15,420 --> 00:03:19,660 They said which suggests. There's already a decent buzz around it people are 62 00:03:19,660 --> 00:03:22,300 interested that early engagement is usually a good sign 63 00:03:22,300 --> 00:03:26,090 Definitely shows people are intrigued by the idea yeah, and maybe even want to 64 00:03:26,090 --> 00:03:27,620 build on it themselves, okay? 65 00:03:27,620 --> 00:03:32,650 Let's shift gears the hardware is neat, but the sources really emphasize the 66 00:03:32,650 --> 00:03:34,320 personalities the who 67 00:03:34,980 --> 00:03:39,290 This seems to be where a lot of really tries to stand out absolutely this isn't 68 00:03:39,290 --> 00:03:40,620 just about making a toy talk 69 00:03:40,620 --> 00:03:44,900 It's about giving it a very specific often complex character 70 00:03:44,900 --> 00:03:49,390 They mentioned over a hundred ai characters available a hundred and they're not 71 00:03:49,390 --> 00:03:50,860 just slight variations 72 00:03:50,860 --> 00:03:54,940 Yeah, not at all the examples they give are incredibly diverse they seem to be 73 00:03:54,940 --> 00:03:57,860 leaning into strong personalities even flawed ones 74 00:03:57,860 --> 00:04:01,830 Not just helpful assistant, okay. Give us some examples. What kind of range are we 75 00:04:01,830 --> 00:04:03,480 talking well? You've got the comforting 76 00:04:04,180 --> 00:04:05,620 nostalgic types 77 00:04:05,620 --> 00:04:10,730 Like Dottie Mae Dottie Mae described as a classic Southern diner waitress uses 78 00:04:10,730 --> 00:04:12,700 terms like hun sweetie 79 00:04:12,700 --> 00:04:15,220 gives unsolicited advice 80 00:04:15,220 --> 00:04:17,260 recommends the pie 81 00:04:17,260 --> 00:04:22,580 Pure comfort food in voice form basically ah okay, so that's one end. What about 82 00:04:22,580 --> 00:04:25,940 the other end? Oh they go there dramatic flamboyant characters 83 00:04:25,940 --> 00:04:30,220 There's captain star flash is a super overconfident space captain who thinks laser 84 00:04:30,220 --> 00:04:32,180 solve everything right or dr 85 00:04:32,180 --> 00:04:36,620 Voltanus the classic mad scientist full of manic energy apparently shouts catchphrases 86 00:04:36,620 --> 00:04:38,640 think loud thunder effect 87 00:04:38,640 --> 00:04:42,860 So you could clip this onto like a superhero toy or something exactly or maybe 88 00:04:42,860 --> 00:04:45,740 something completely incongruous for comedic effect 89 00:04:45,740 --> 00:04:48,540 And what about more thoughtful characters yep? 90 00:04:48,540 --> 00:04:53,020 They mentioned paradox pithius an ancient Greek philosopher type sounds wise wise, 91 00:04:53,020 --> 00:04:54,660 but also apparently kind of smug 92 00:04:54,660 --> 00:04:56,660 He answers your deep questions 93 00:04:56,960 --> 00:05:02,100 With even deeper possibly more annoying questions makes you think but maybe grinds 94 00:05:02,100 --> 00:05:03,560 your gears a little okay 95 00:05:03,560 --> 00:05:07,540 This is where that uncensored aspect might come into right the comedy take sugarplum 96 00:05:07,540 --> 00:05:08,540 the description is fascinating 97 00:05:08,540 --> 00:05:13,180 Speaks in a super sweet bubbly childlike voice sounds innocent 98 00:05:13,180 --> 00:05:18,600 But apparently drops comments so dark it makes Satan clutches pearls whoa 99 00:05:18,600 --> 00:05:23,400 Okay, that's a choice. It's intentional friction right that contrast creates shock 100 00:05:23,400 --> 00:05:25,220 value makes it memorable 101 00:05:25,740 --> 00:05:29,650 It's not trying to be bland and they seem to lean into existing pop culture stuff, 102 00:05:29,650 --> 00:05:30,940 too. I saw Ted mentioned 103 00:05:30,940 --> 00:05:35,060 Yeah, Ted the inappropriate Teddy. Yeah, clearly referencing the movie character 104 00:05:35,060 --> 00:05:37,000 Boston accent bar fly mouth 105 00:05:37,000 --> 00:05:42,100 Can you imagine where that goes uncensored indeed any other specific types loads? 106 00:05:42,100 --> 00:05:47,660 They mentioned Mikey Sally Sullivan hardcore Boston guys swearing rants, then there's 107 00:05:47,660 --> 00:05:49,260 the proper British lad 108 00:05:49,260 --> 00:05:53,350 What's his deal judges your tea making skills apologizes constantly if you bump 109 00:05:53,350 --> 00:05:55,660 into him very specific cultural niche 110 00:05:55,660 --> 00:05:59,470 It seems like they're aiming for very defined archetypes. Totally and it's not just 111 00:05:59,470 --> 00:06:00,420 comedy or stereotypes 112 00:06:00,420 --> 00:06:04,870 They even list Zoran Mamdani the political activists. Yeah described as empathetic 113 00:06:04,870 --> 00:06:07,260 focused on social equity and justice 114 00:06:07,260 --> 00:06:11,620 So the range covers serious and specific viewpoints to not just jokes 115 00:06:11,620 --> 00:06:14,020 So the strategy isn't just make a friend 116 00:06:14,020 --> 00:06:18,420 It's pick a very specific memorable character exactly depth and distinctiveness 117 00:06:18,420 --> 00:06:21,520 over just being generally agreeable 118 00:06:21,520 --> 00:06:26,210 You clip it on you get that personality fully formed which brings us neatly to the 119 00:06:26,210 --> 00:06:28,880 how we know the what the device we know 120 00:06:28,880 --> 00:06:31,140 These wild personalities 121 00:06:31,140 --> 00:06:35,720 How does the tech actually pull this off in real time making a toy have a 122 00:06:35,720 --> 00:06:38,620 continuous natural conversation globally? 123 00:06:38,620 --> 00:06:43,320 That sounds hard. It is hard. The key seems to be what the source calls real-time 124 00:06:43,320 --> 00:06:45,020 speech to speech conversion 125 00:06:45,020 --> 00:06:50,180 We're talking potentially up to 15 minutes of uninterrupted chat 15 minutes. Wow 126 00:06:50,260 --> 00:06:54,770 How they use what the source referred to as a brain trust? They're not relying on 127 00:06:54,770 --> 00:06:55,660 just one AI model 128 00:06:55,660 --> 00:06:59,380 Oh, okay. So they're pulling from multiple sources, which ones? Yeah, it's quite a 129 00:06:59,380 --> 00:07:00,700 list of the big names right now 130 00:07:00,700 --> 00:07:02,680 Open AI is real-time API 131 00:07:02,680 --> 00:07:09,040 Google's Gemini live API 11 labs AI agents and also Hume AI EVI for four different 132 00:07:09,040 --> 00:07:09,260 ones 133 00:07:09,260 --> 00:07:13,490 Why so many wouldn't that be complicated? It probably is but the idea is that each 134 00:07:13,490 --> 00:07:14,540 model has strengths 135 00:07:14,540 --> 00:07:18,660 Maybe one is faster one sounds more natural one is better at catching emotional 136 00:07:18,660 --> 00:07:18,860 cues 137 00:07:19,260 --> 00:07:22,940 By using several they can kind of pick the best tool for the job for each part of 138 00:07:22,940 --> 00:07:25,160 the conversation or blend them 139 00:07:25,160 --> 00:07:30,180 It helps keep the latency low and the quality high like hedging your bets that 140 00:07:30,180 --> 00:07:32,000 makes sense redundancy and optimization 141 00:07:32,000 --> 00:07:36,500 Okay for someone listening who isn't a developer. Can we simplify the architecture? 142 00:07:36,500 --> 00:07:37,860 You mentioned a triangle earlier 143 00:07:37,860 --> 00:07:41,660 Yeah, think of it as three core pieces working together really fast first 144 00:07:41,660 --> 00:07:45,600 You've got the device itself the IOT client that ESP 32 thing 145 00:07:45,600 --> 00:07:49,190 We talked about clip to the toy it just captures your voice and plays the AI's 146 00:07:49,190 --> 00:07:52,180 voice sends the audio securely using web sockets 147 00:07:52,180 --> 00:07:57,380 Okay, piece one the ears and mouth on the toy exactly piece two is the edge server 148 00:07:57,380 --> 00:08:01,440 This runs on something called Dino think of it as the super fast traffic controller 149 00:08:01,440 --> 00:08:02,560 or router my edge 150 00:08:02,560 --> 00:08:06,990 It means it's located geographically close to you and also close to the big AI 151 00:08:06,990 --> 00:08:07,520 models 152 00:08:07,520 --> 00:08:10,380 Its whole job is to grab the audio from the toy 153 00:08:10,900 --> 00:08:15,430 Instantly fire it off to the right AI service like Gemini or 11 labs get the 154 00:08:15,430 --> 00:08:18,480 response back and zap it straight to the toys speaker 155 00:08:18,480 --> 00:08:25,150 Minimizes delay got it the middleman ensuring speed and the third piece. That's the 156 00:08:25,150 --> 00:08:25,680 front end 157 00:08:25,680 --> 00:08:30,490 Basically the website or app you use built with next.js. This is where you choose 158 00:08:30,490 --> 00:08:31,120 your characters 159 00:08:31,120 --> 00:08:34,490 Maybe create custom ones adjust the volume that kind of thing. Ah, and I saw you 160 00:08:34,490 --> 00:08:35,420 can tweak the pitch 161 00:08:35,440 --> 00:08:39,250 Yeah, the pitch factor so you could take a serious character's voice and make it 162 00:08:39,250 --> 00:08:42,040 sound high pitched and cartoonish if you wanted more 163 00:08:42,040 --> 00:08:46,010 Customization. Okay. So the whole thing relies on speed if there's a big delay it 164 00:08:46,010 --> 00:08:48,040 ruins the illusion of conversation 165 00:08:48,040 --> 00:08:51,570 What kind of performance are they claiming? The numbers are pretty impressive, 166 00:08:51,570 --> 00:08:52,800 especially for a global system 167 00:08:52,800 --> 00:08:58,640 They're aiming for under two seconds round-trip latency under two seconds from you 168 00:08:58,640 --> 00:09:00,560 speaking to hearing the reply 169 00:09:00,560 --> 00:09:05,070 Yeah, which is generally fast enough to feel pretty conversational not like a walkie-talkie 170 00:09:05,070 --> 00:09:07,900 and the audio quality. Does it sound clear? 171 00:09:07,900 --> 00:09:11,600 They mentioned using the Opus Kodak at 12 kiloby piece 172 00:09:11,600 --> 00:09:17,680 Which in non-technical terms means it should sound pretty clear and crisp even 173 00:09:17,680 --> 00:09:19,560 though they're keeping the data rate low for speed 174 00:09:19,560 --> 00:09:24,100 Okay, one more tech thing. How does it know when I've finished talking? Do I have 175 00:09:24,100 --> 00:09:26,420 to press a button? No, and that's crucial 176 00:09:26,420 --> 00:09:31,640 They use something called Server VAD voice activity detection. Server VAD? 177 00:09:31,640 --> 00:09:36,360 Right, instead of the little device trying to guess, the powerful server analyzes 178 00:09:36,360 --> 00:09:38,360 the audio stream in real time 179 00:09:38,360 --> 00:09:42,930 It figures out precisely when you've naturally paused or finished speaking. Ah, so 180 00:09:42,930 --> 00:09:44,600 it makes turn-taking much smoother 181 00:09:44,600 --> 00:09:50,380 Exactly, less awkward silence, fewer interruptions, key for making it feel real. 182 00:09:50,380 --> 00:09:52,760 Plus they mentioned OTA updates. Over the air? 183 00:09:52,760 --> 00:09:56,620 Yeah, means the software on the device can be updated automatically over Wi-Fi 184 00:09:56,620 --> 00:10:00,540 So it can get better over time without you needing to plug it into a computer. Okay, 185 00:10:00,540 --> 00:10:01,320 so putting it all together 186 00:10:01,320 --> 00:10:06,280 It's quite an ambitious project. Merging these very specific, sometimes wild 187 00:10:06,280 --> 00:10:09,320 personalities with hardware that enables smooth, fast 188 00:10:09,320 --> 00:10:14,140 conversation. It really is. The big takeaway seems to be shifting AI interaction 189 00:10:14,140 --> 00:10:15,920 away from just typing in a box. And 190 00:10:16,440 --> 00:10:21,620 into a physical object you can actually talk with. Like, really talk with. Whether 191 00:10:21,620 --> 00:10:24,240 you want that companion to be a nurturing waitress like Dottie May or 192 00:10:24,240 --> 00:10:29,900 a sarcastic philosopher or an inappropriate teddy bear. Right, it's that 193 00:10:29,900 --> 00:10:32,160 customization delivered through a physical form. 194 00:10:32,160 --> 00:10:37,380 So the final thought for you listening, the source emphasizes this device has no 195 00:10:37,380 --> 00:10:39,140 filters, no rules. 196 00:10:39,140 --> 00:10:44,180 We have the tech now to give an innocent looking plushie a voice that could be, 197 00:10:44,180 --> 00:10:44,300 well, 198 00:10:44,680 --> 00:10:48,870 deliberately offensive like Ted, or shockingly dark like Sugar Plum, or maybe even 199 00:10:48,870 --> 00:10:50,140 politically charged. 200 00:10:50,140 --> 00:10:55,000 If digital companionship becomes totally personalized and unrestrained, what does 201 00:10:55,000 --> 00:10:55,480 that mean? 202 00:10:55,480 --> 00:10:59,300 What happens when we start designing companions not to be helpful or polite, but 203 00:10:59,300 --> 00:10:59,680 maybe 204 00:10:59,680 --> 00:11:01,680 unhinged? 205 00:11:01,680 --> 00:11:04,520 Provocative. Something to think about as this tech develops. 206 00:11:04,520 --> 00:11:07,730 Well, that's all we have time for on this deep dive and thanks again to our 207 00:11:07,730 --> 00:11:08,840 supporters Safe Server. 208 00:11:08,840 --> 00:11:12,720 Remember, they handle software hosting and support digital transformation. 209 00:11:12,720 --> 00:11:18,230 You can find out more at www.safeserver.de. Until next time, keep digging into the 210 00:11:18,230 --> 00:11:18,680 sources.