1
00:00:00,000 --> 00:00:03,740
Welcome to the deep dive before we jump in today a quick. Thank you to our

2
00:00:03,740 --> 00:00:05,040
supporter safe server

3
00:00:05,040 --> 00:00:09,470
Safe server handles software hosting and they're really focused on supporting your

4
00:00:09,470 --> 00:00:10,820
digital transformation

5
00:00:10,820 --> 00:00:13,820
So if you're looking for reliable hosting you can check them out at

6
00:00:13,820 --> 00:00:18,320
www.safe server dot de like I said today

7
00:00:18,320 --> 00:00:22,480
We're embarking on a mission that feels a bit

8
00:00:22,480 --> 00:00:27,420
Well bit sci-fi maybe don't it leans that way we're looking at a lotto AI

9
00:00:27,460 --> 00:00:32,650
Right, and the core idea is taking sophisticated conversational AI like really

10
00:00:32,650 --> 00:00:34,980
advanced stuff and getting it out of the screen

11
00:00:34,980 --> 00:00:40,130
Out of your phone or computer and into physical things specifically toys plushies

12
00:00:40,130 --> 00:00:41,560
even it sounds simple

13
00:00:41,560 --> 00:00:45,420
But the sources suggest this is way beyond just a talking toy exactly

14
00:00:45,420 --> 00:00:48,900
We're gonna break down how they're trying to merge the hardware the software and

15
00:00:48,900 --> 00:00:51,840
these really distinct AI personalities

16
00:00:51,840 --> 00:00:56,540
The goal seems to be making these interactions feel hyper realistic. That's the

17
00:00:56,540 --> 00:00:57,100
hook isn't it?

18
00:00:57,100 --> 00:01:00,500
The source material literally says they're giving plushies voices that feel

19
00:01:00,500 --> 00:01:01,620
ridiculously real

20
00:01:01,620 --> 00:01:05,580
Yeah, it made me think of that movie Ted the talking teddy bear, right?

21
00:01:05,580 --> 00:01:10,620
But imagine that powered by actual live cutting-edge AI. That's kind of wild and

22
00:01:10,620 --> 00:01:11,700
that's what we're diving into

23
00:01:11,700 --> 00:01:15,210
It's sort of the next step for digital companions moving them into the physical

24
00:01:15,210 --> 00:01:18,200
world. Okay, let's start with the basics the hardware

25
00:01:18,200 --> 00:01:23,220
For anyone maybe new to this kind of tech. What is the a lotto device?

26
00:01:23,980 --> 00:01:29,810
Physically well at its heart. It's a small gadget an IOT client. Technically. He's

27
00:01:29,810 --> 00:01:30,580
got the microphone

28
00:01:30,580 --> 00:01:35,330
It's got the speaker but the clever part is how it attaches. Okay, it uses two

29
00:01:35,330 --> 00:01:36,780
simple silicone straps

30
00:01:36,780 --> 00:01:40,900
So you can clip it on to pretty much any toy you already have. Oh, right

31
00:01:40,900 --> 00:01:45,460
So you don't need to buy their specific toy. Nope that old teddy bear in the attic

32
00:01:45,460 --> 00:01:49,400
Yeah, suddenly you can have you know a brain and a voice that flexibility seems

33
00:01:49,400 --> 00:01:50,220
like a big deal

34
00:01:50,220 --> 00:01:55,430
It really is and the setup sounds incredibly simple aimed at well anyone no tech

35
00:01:55,430 --> 00:01:56,540
skills needed

36
00:01:56,540 --> 00:02:00,980
How simple are we talking like three steps simple first clip the device onto the

37
00:02:00,980 --> 00:02:01,460
toy?

38
00:02:01,460 --> 00:02:06,200
Okay, second connect it to your home Wi-Fi. It uses what's called a captive portal

39
00:02:06,200 --> 00:02:09,760
Uh-huh like when you connect at a hotel exactly that it makes its own little

40
00:02:09,760 --> 00:02:10,100
network

41
00:02:10,100 --> 00:02:14,670
Temporarily to guide you super easy and third pick a character personality from

42
00:02:14,670 --> 00:02:17,380
their list and just start talking to it

43
00:02:17,380 --> 00:02:21,670
Wow, okay. Now I saw they're actually two different products mentioned. Yeah, they're

44
00:02:21,670 --> 00:02:22,900
catering to slightly different people

45
00:02:22,900 --> 00:02:26,960
There's the main AI device. That's the consumer one, right pre-order price

46
00:02:26,960 --> 00:02:29,500
mentioned was $69 that gets you the device

47
00:02:29,500 --> 00:02:34,080
Access to all the AI characters unlimited apparently and a free month of their

48
00:02:34,080 --> 00:02:35,500
premium subscription

49
00:02:35,500 --> 00:02:38,800
It's the plug-and-play version and the other one for tankers

50
00:02:38,800 --> 00:02:42,980
That's the AI dev kit a bit cheaper $59 on pre-order

51
00:02:42,980 --> 00:02:47,340
This one's really for developers makers people who want to mess around with it. How

52
00:02:47,340 --> 00:02:48,980
so it has open source firmware

53
00:02:48,980 --> 00:02:54,480
Runs over a standard USB C connection and lets you load your own custom voices or

54
00:02:54,480 --> 00:02:55,900
even your own AI models

55
00:02:55,900 --> 00:02:59,950
If you want much more flexible if you're technically inclined gotcha and practical

56
00:02:59,950 --> 00:03:00,340
things

57
00:03:00,340 --> 00:03:04,980
Battery life is this thing always plugged in apparently not they claim a week of

58
00:03:04,980 --> 00:03:08,100
battery life, which is pretty good makes it actually portable

59
00:03:08,100 --> 00:03:12,310
Yeah, that's essential if it's meant to be a companion, and I saw something about

60
00:03:12,310 --> 00:03:15,420
community support uh-huh over 1200 stars

61
00:03:15,420 --> 00:03:19,660
They said which suggests. There's already a decent buzz around it people are

62
00:03:19,660 --> 00:03:22,300
interested that early engagement is usually a good sign

63
00:03:22,300 --> 00:03:26,090
Definitely shows people are intrigued by the idea yeah, and maybe even want to

64
00:03:26,090 --> 00:03:27,620
build on it themselves, okay?

65
00:03:27,620 --> 00:03:32,650
Let's shift gears the hardware is neat, but the sources really emphasize the

66
00:03:32,650 --> 00:03:34,320
personalities the who

67
00:03:34,980 --> 00:03:39,290
This seems to be where a lot of really tries to stand out absolutely this isn't

68
00:03:39,290 --> 00:03:40,620
just about making a toy talk

69
00:03:40,620 --> 00:03:44,900
It's about giving it a very specific often complex character

70
00:03:44,900 --> 00:03:49,390
They mentioned over a hundred ai characters available a hundred and they're not

71
00:03:49,390 --> 00:03:50,860
just slight variations

72
00:03:50,860 --> 00:03:54,940
Yeah, not at all the examples they give are incredibly diverse they seem to be

73
00:03:54,940 --> 00:03:57,860
leaning into strong personalities even flawed ones

74
00:03:57,860 --> 00:04:01,830
Not just helpful assistant, okay. Give us some examples. What kind of range are we

75
00:04:01,830 --> 00:04:03,480
talking well? You've got the comforting

76
00:04:04,180 --> 00:04:05,620
nostalgic types

77
00:04:05,620 --> 00:04:10,730
Like Dottie Mae Dottie Mae described as a classic Southern diner waitress uses

78
00:04:10,730 --> 00:04:12,700
terms like hun sweetie

79
00:04:12,700 --> 00:04:15,220
gives unsolicited advice

80
00:04:15,220 --> 00:04:17,260
recommends the pie

81
00:04:17,260 --> 00:04:22,580
Pure comfort food in voice form basically ah okay, so that's one end. What about

82
00:04:22,580 --> 00:04:25,940
the other end? Oh they go there dramatic flamboyant characters

83
00:04:25,940 --> 00:04:30,220
There's captain star flash is a super overconfident space captain who thinks laser

84
00:04:30,220 --> 00:04:32,180
solve everything right or dr

85
00:04:32,180 --> 00:04:36,620
Voltanus the classic mad scientist full of manic energy apparently shouts catchphrases

86
00:04:36,620 --> 00:04:38,640
think loud thunder effect

87
00:04:38,640 --> 00:04:42,860
So you could clip this onto like a superhero toy or something exactly or maybe

88
00:04:42,860 --> 00:04:45,740
something completely incongruous for comedic effect

89
00:04:45,740 --> 00:04:48,540
And what about more thoughtful characters yep?

90
00:04:48,540 --> 00:04:53,020
They mentioned paradox pithius an ancient Greek philosopher type sounds wise wise,

91
00:04:53,020 --> 00:04:54,660
but also apparently kind of smug

92
00:04:54,660 --> 00:04:56,660
He answers your deep questions

93
00:04:56,960 --> 00:05:02,100
With even deeper possibly more annoying questions makes you think but maybe grinds

94
00:05:02,100 --> 00:05:03,560
your gears a little okay

95
00:05:03,560 --> 00:05:07,540
This is where that uncensored aspect might come into right the comedy take sugarplum

96
00:05:07,540 --> 00:05:08,540
the description is fascinating

97
00:05:08,540 --> 00:05:13,180
Speaks in a super sweet bubbly childlike voice sounds innocent

98
00:05:13,180 --> 00:05:18,600
But apparently drops comments so dark it makes Satan clutches pearls whoa

99
00:05:18,600 --> 00:05:23,400
Okay, that's a choice. It's intentional friction right that contrast creates shock

100
00:05:23,400 --> 00:05:25,220
value makes it memorable

101
00:05:25,740 --> 00:05:29,650
It's not trying to be bland and they seem to lean into existing pop culture stuff,

102
00:05:29,650 --> 00:05:30,940
too. I saw Ted mentioned

103
00:05:30,940 --> 00:05:35,060
Yeah, Ted the inappropriate Teddy. Yeah, clearly referencing the movie character

104
00:05:35,060 --> 00:05:37,000
Boston accent bar fly mouth

105
00:05:37,000 --> 00:05:42,100
Can you imagine where that goes uncensored indeed any other specific types loads?

106
00:05:42,100 --> 00:05:47,660
They mentioned Mikey Sally Sullivan hardcore Boston guys swearing rants, then there's

107
00:05:47,660 --> 00:05:49,260
the proper British lad

108
00:05:49,260 --> 00:05:53,350
What's his deal judges your tea making skills apologizes constantly if you bump

109
00:05:53,350 --> 00:05:55,660
into him very specific cultural niche

110
00:05:55,660 --> 00:05:59,470
It seems like they're aiming for very defined archetypes. Totally and it's not just

111
00:05:59,470 --> 00:06:00,420
comedy or stereotypes

112
00:06:00,420 --> 00:06:04,870
They even list Zoran Mamdani the political activists. Yeah described as empathetic

113
00:06:04,870 --> 00:06:07,260
focused on social equity and justice

114
00:06:07,260 --> 00:06:11,620
So the range covers serious and specific viewpoints to not just jokes

115
00:06:11,620 --> 00:06:14,020
So the strategy isn't just make a friend

116
00:06:14,020 --> 00:06:18,420
It's pick a very specific memorable character exactly depth and distinctiveness

117
00:06:18,420 --> 00:06:21,520
over just being generally agreeable

118
00:06:21,520 --> 00:06:26,210
You clip it on you get that personality fully formed which brings us neatly to the

119
00:06:26,210 --> 00:06:28,880
how we know the what the device we know

120
00:06:28,880 --> 00:06:31,140
These wild personalities

121
00:06:31,140 --> 00:06:35,720
How does the tech actually pull this off in real time making a toy have a

122
00:06:35,720 --> 00:06:38,620
continuous natural conversation globally?

123
00:06:38,620 --> 00:06:43,320
That sounds hard. It is hard. The key seems to be what the source calls real-time

124
00:06:43,320 --> 00:06:45,020
speech to speech conversion

125
00:06:45,020 --> 00:06:50,180
We're talking potentially up to 15 minutes of uninterrupted chat 15 minutes. Wow

126
00:06:50,260 --> 00:06:54,770
How they use what the source referred to as a brain trust? They're not relying on

127
00:06:54,770 --> 00:06:55,660
just one AI model

128
00:06:55,660 --> 00:06:59,380
Oh, okay. So they're pulling from multiple sources, which ones? Yeah, it's quite a

129
00:06:59,380 --> 00:07:00,700
list of the big names right now

130
00:07:00,700 --> 00:07:02,680
Open AI is real-time API

131
00:07:02,680 --> 00:07:09,040
Google's Gemini live API 11 labs AI agents and also Hume AI EVI for four different

132
00:07:09,040 --> 00:07:09,260
ones

133
00:07:09,260 --> 00:07:13,490
Why so many wouldn't that be complicated? It probably is but the idea is that each

134
00:07:13,490 --> 00:07:14,540
model has strengths

135
00:07:14,540 --> 00:07:18,660
Maybe one is faster one sounds more natural one is better at catching emotional

136
00:07:18,660 --> 00:07:18,860
cues

137
00:07:19,260 --> 00:07:22,940
By using several they can kind of pick the best tool for the job for each part of

138
00:07:22,940 --> 00:07:25,160
the conversation or blend them

139
00:07:25,160 --> 00:07:30,180
It helps keep the latency low and the quality high like hedging your bets that

140
00:07:30,180 --> 00:07:32,000
makes sense redundancy and optimization

141
00:07:32,000 --> 00:07:36,500
Okay for someone listening who isn't a developer. Can we simplify the architecture?

142
00:07:36,500 --> 00:07:37,860
You mentioned a triangle earlier

143
00:07:37,860 --> 00:07:41,660
Yeah, think of it as three core pieces working together really fast first

144
00:07:41,660 --> 00:07:45,600
You've got the device itself the IOT client that ESP 32 thing

145
00:07:45,600 --> 00:07:49,190
We talked about clip to the toy it just captures your voice and plays the AI's

146
00:07:49,190 --> 00:07:52,180
voice sends the audio securely using web sockets

147
00:07:52,180 --> 00:07:57,380
Okay, piece one the ears and mouth on the toy exactly piece two is the edge server

148
00:07:57,380 --> 00:08:01,440
This runs on something called Dino think of it as the super fast traffic controller

149
00:08:01,440 --> 00:08:02,560
or router my edge

150
00:08:02,560 --> 00:08:06,990
It means it's located geographically close to you and also close to the big AI

151
00:08:06,990 --> 00:08:07,520
models

152
00:08:07,520 --> 00:08:10,380
Its whole job is to grab the audio from the toy

153
00:08:10,900 --> 00:08:15,430
Instantly fire it off to the right AI service like Gemini or 11 labs get the

154
00:08:15,430 --> 00:08:18,480
response back and zap it straight to the toys speaker

155
00:08:18,480 --> 00:08:25,150
Minimizes delay got it the middleman ensuring speed and the third piece. That's the

156
00:08:25,150 --> 00:08:25,680
front end

157
00:08:25,680 --> 00:08:30,490
Basically the website or app you use built with next.js. This is where you choose

158
00:08:30,490 --> 00:08:31,120
your characters

159
00:08:31,120 --> 00:08:34,490
Maybe create custom ones adjust the volume that kind of thing. Ah, and I saw you

160
00:08:34,490 --> 00:08:35,420
can tweak the pitch

161
00:08:35,440 --> 00:08:39,250
Yeah, the pitch factor so you could take a serious character's voice and make it

162
00:08:39,250 --> 00:08:42,040
sound high pitched and cartoonish if you wanted more

163
00:08:42,040 --> 00:08:46,010
Customization. Okay. So the whole thing relies on speed if there's a big delay it

164
00:08:46,010 --> 00:08:48,040
ruins the illusion of conversation

165
00:08:48,040 --> 00:08:51,570
What kind of performance are they claiming? The numbers are pretty impressive,

166
00:08:51,570 --> 00:08:52,800
especially for a global system

167
00:08:52,800 --> 00:08:58,640
They're aiming for under two seconds round-trip latency under two seconds from you

168
00:08:58,640 --> 00:09:00,560
speaking to hearing the reply

169
00:09:00,560 --> 00:09:05,070
Yeah, which is generally fast enough to feel pretty conversational not like a walkie-talkie

170
00:09:05,070 --> 00:09:07,900
and the audio quality. Does it sound clear?

171
00:09:07,900 --> 00:09:11,600
They mentioned using the Opus Kodak at 12 kiloby piece

172
00:09:11,600 --> 00:09:17,680
Which in non-technical terms means it should sound pretty clear and crisp even

173
00:09:17,680 --> 00:09:19,560
though they're keeping the data rate low for speed

174
00:09:19,560 --> 00:09:24,100
Okay, one more tech thing. How does it know when I've finished talking? Do I have

175
00:09:24,100 --> 00:09:26,420
to press a button? No, and that's crucial

176
00:09:26,420 --> 00:09:31,640
They use something called Server VAD voice activity detection. Server VAD?

177
00:09:31,640 --> 00:09:36,360
Right, instead of the little device trying to guess, the powerful server analyzes

178
00:09:36,360 --> 00:09:38,360
the audio stream in real time

179
00:09:38,360 --> 00:09:42,930
It figures out precisely when you've naturally paused or finished speaking. Ah, so

180
00:09:42,930 --> 00:09:44,600
it makes turn-taking much smoother

181
00:09:44,600 --> 00:09:50,380
Exactly, less awkward silence, fewer interruptions, key for making it feel real.

182
00:09:50,380 --> 00:09:52,760
Plus they mentioned OTA updates. Over the air?

183
00:09:52,760 --> 00:09:56,620
Yeah, means the software on the device can be updated automatically over Wi-Fi

184
00:09:56,620 --> 00:10:00,540
So it can get better over time without you needing to plug it into a computer. Okay,

185
00:10:00,540 --> 00:10:01,320
so putting it all together

186
00:10:01,320 --> 00:10:06,280
It's quite an ambitious project. Merging these very specific, sometimes wild

187
00:10:06,280 --> 00:10:09,320
personalities with hardware that enables smooth, fast

188
00:10:09,320 --> 00:10:14,140
conversation. It really is. The big takeaway seems to be shifting AI interaction

189
00:10:14,140 --> 00:10:15,920
away from just typing in a box. And

190
00:10:16,440 --> 00:10:21,620
into a physical object you can actually talk with. Like, really talk with. Whether

191
00:10:21,620 --> 00:10:24,240
you want that companion to be a nurturing waitress like Dottie May or

192
00:10:24,240 --> 00:10:29,900
a sarcastic philosopher or an inappropriate teddy bear. Right, it's that

193
00:10:29,900 --> 00:10:32,160
customization delivered through a physical form.

194
00:10:32,160 --> 00:10:37,380
So the final thought for you listening, the source emphasizes this device has no

195
00:10:37,380 --> 00:10:39,140
filters, no rules.

196
00:10:39,140 --> 00:10:44,180
We have the tech now to give an innocent looking plushie a voice that could be,

197
00:10:44,180 --> 00:10:44,300
well,

198
00:10:44,680 --> 00:10:48,870
deliberately offensive like Ted, or shockingly dark like Sugar Plum, or maybe even

199
00:10:48,870 --> 00:10:50,140
politically charged.

200
00:10:50,140 --> 00:10:55,000
If digital companionship becomes totally personalized and unrestrained, what does

201
00:10:55,000 --> 00:10:55,480
that mean?

202
00:10:55,480 --> 00:10:59,300
What happens when we start designing companions not to be helpful or polite, but

203
00:10:59,300 --> 00:10:59,680
maybe

204
00:10:59,680 --> 00:11:01,680
unhinged?

205
00:11:01,680 --> 00:11:04,520
Provocative. Something to think about as this tech develops.

206
00:11:04,520 --> 00:11:07,730
Well, that's all we have time for on this deep dive and thanks again to our

207
00:11:07,730 --> 00:11:08,840
supporters Safe Server.

208
00:11:08,840 --> 00:11:12,720
Remember, they handle software hosting and support digital transformation.

209
00:11:12,720 --> 00:11:18,230
You can find out more at www.safeserver.de. Until next time, keep digging into the

210
00:11:18,230 --> 00:11:18,680
sources.