Today's Deep-Dive: Lightly Studio

0:00

Welcome back to the deep dive.

0:02

If you've been watching fields like computer vision

0:04

or multimodal machine learning grow,

0:08

you quickly realize something.

0:10

It's often not the algorithms holding things back.

0:13

It's the data, like managing it, cleaning it up.

0:16

Yeah, and figuring out what's actually worth labeling.

0:18

That's a huge one.

0:19

Exactly.

0:20

So today, we're doing a deep dive into a platform built

0:24

specifically for that problem, Lightly Studio.

0:26

We've gathered a bunch of sources

0:28

to really unpack this open source tool.

0:30

Our goal here is pretty simple.

0:32

Break down what Lightly Studio is, who it's for,

0:35

and walk through the key ideas that, well,

0:37

aim to turn messy data wrangling into something more automated.

0:40

Try to make it approachable, even if you're just

0:42

getting started in this space.

0:44

But before we jump into the data pipelines themselves,

0:47

we really want to thank the supporter of this deep dive,

0:49

SafeServer.

0:50

SafeServer handles hosting for exactly this kind

0:53

of specialized software, and they support

0:55

your digital transformation.

0:57

So when you need to deploy tools like Lightly Studio,

0:59

they provide that crucial infrastructure.

1:02

You can find out more and start your own journey

1:04

at www.safeserver.de.

1:06

And that infrastructure piece is really key,

1:09

because scaling these modern ML projects,

1:12

it often hits a wall when the data just gets out of control.

1:15

You've got to curate it, label it right,

1:16

keep track of every version, every change.

1:19

It's a lot.

1:20

Lightly Studio really pitches itself

1:22

as that unified data platform for multimodal ML.

1:26

The idea is consolidating all those tricky, separate steps

1:29

into one place, make it manageable.

1:31

OK, so let's start right there.

1:33

What is this tool, fundamentally?

1:35

We know it's from Lightly, it's open source.

1:37

But what are those main tasks it's trying to unify?

1:39

Right, it's about unifying that whole workflow, curation,

1:43

annotation, and management.

1:46

It was really built by ML engineers

1:47

for ML engineers and organizations

1:49

that are trying to scale up computer vision work.

1:51

They recognize that, well, speed and flexibility are paramount.

1:54

At speed, you mentioned, how does it achieve that?

1:56

Well, the sources point to a pretty key architectural choice.

1:59

It's built using Rust.

2:02

Rust, OK.

2:03

Yeah, and Rust is known for performance, right?

2:06

And memory safety.

2:07

So what that means for someone using it is efficiency.

2:10

You can actually handle massive data sets like Kokyo or ImageNet

2:13

Scale and do some pretty heavy processing,

2:17

even on, say, standard hardware.

2:19

Not necessarily a giant server farm.

2:21

Like what kind of standard hardware?

2:23

I think like a decent laptop, an M1 MacBook Pro with maybe 16 gigs

2:28

of RAM, that kind of thing.

2:29

OK, that's impressive.

2:30

So it's got this powerful engine.

2:32

What are the actual functions?

2:34

What does the platform do day to day?

2:36

Yeah, there are really four core functions

2:38

that kind of cover the whole data lifecycle representing

2:41

its main value.

2:42

First up, you've got label and QA,

2:44

so built-in tools for annotating images and videos.

2:46

Pretty essential.

2:48

Second, it helps you understand and visualize your data.

2:51

So you can filter it, automatically find exact duplicates,

2:53

which honestly saves so much time,

2:56

spot those really important edge cases,

2:57

and also catch data drift

3:00

as your real-world conditions change.

3:02

Okay, makes sense.

3:03

Third, it lets you intelligently curate data.

3:06

This means automatically selecting the samples

3:08

that are actually the most valuable for training

3:11

or fine-tuning your model.

3:13

We'll probably dig into that more later.

3:14

Yeah, definitely want to circle back to that.

3:16

And the last one.

3:17

And finally, after you've done all that work,

3:19

you need to export and deploy that curated data set.

3:22

And it lets you do that whether you're running Lightly Studio

3:24

on your own machines, on-prem, using a hybrid cloud setup,

3:28

or fully in the cloud.

3:30

Gotcha, flexible deployment.

3:31

Exactly, and you know, we keep saying multimodal ML.

3:35

We should probably clarify that a bit.

3:36

Good point.

3:37

It means the platform isn't just for like standard photos.

3:40

It really supports a wide range of data types.

3:43

Images, sure, but also video clips, audio files, text,

3:48

and importantly, even specialized formats like DICOM data.

3:51

Oh, the medical imaging format.

3:53

That's the one, yeah, for X-rays, MRIs, that kind of thing.

3:56

So having that breadth really makes it a genuinely unified

4:00

hub for different data types.

4:03

OK, so it handles a lot of data types.

4:05

Let's unpack who actually uses this.

4:08

Who benefits from this unification?

4:10

Yeah.

4:10

Because the source has mentioned like two main groups that

4:12

don't always talk to each other.

4:14

Yeah, that's a good way to put it.

4:15

It tries to bridge that gap.

4:16

So on one side, you've got the ML engineers, data scientists,

4:21

the infrastructure folks.

4:22

Yeah.

4:23

What's in it for them specifically?

4:25

Right.

4:25

For the engineers, it's really all

4:27

about integration and automation.

4:29

It's built with SDKs and API support, all based

4:33

on open source standards.

4:35

So the idea is it slots into their existing ML stacks

4:38

pretty easily without needing a total rewrite of everything.

4:42

Less disruption.

4:42

Exactly.

4:43

And automation is key, right?

4:44

That's handled mainly through the Python SDK.

4:47

So they can script everything, importing data, managing it.

4:51

They can pull data straight from where they usually

4:53

keep it, like local folders or cloud storage, like S3 or GCS.

4:58

There's big ones from Amazon and Google Cloud, right?

5:00

Yeah, the standard object storage.

5:02

And crucially, once data is in, it's

5:04

not like it's locked forever.

5:05

You can keep adding new data to existing data

5:08

sets as you get it, which is, well,

5:11

vital for any kind of research or iterative development.

5:14

Absolutely.

5:14

OK, so that's the engineers.

5:16

Then on the other side, you mentioned labelers and project

5:18

managers.

5:19

These might be less technical users.

5:21

Often, yeah.

5:22

They're focused on the quality assurance, managing

5:24

large teams, logistics.

5:27

For them, the platform emphasizes

5:30

more intuitive workflows.

5:32

So a user-friendly GUI, tools for collaboration,

5:36

and really critical features, like data set versioning.

5:38

Oh, versioning is huge.

5:39

Anyone who's tried to reproduce an old result knows that pain.

5:42

Totally.

5:43

Knowing exactly which version of the labels

5:45

you used six months ago, yeah, it's crucial.

5:48

Plus, things like role-based permissions

5:50

to manage who can do what in the annotation team

5:52

make sense for project managers.

5:54

So it's bridging these two worlds.

5:55

And interestingly, it seems very focused

5:58

on making it easy to switch to Lightly Studio.

6:00

The sources actually call out that it simplifies migrating

6:03

data from other tools.

6:04

Oh, like competitors.

6:05

Yeah, they mention names like Encore, Voxel 51, Ultralytics,

6:09

V7 Labs, Roboflow, popular tools in the space.

6:13

It seems like they actively want to be that central data hub,

6:16

reducing the friction if the team decides, OK,

6:19

we need to consolidate onto one platform.

6:21

OK, that makes strategic sense.

6:22

Now, here's where I think it gets really interesting,

6:25

especially for folks wanting to automate things.

6:27

The Python interface.

6:29

We don't need to become Python experts here.

6:31

But understanding the basic concepts

6:33

seems key to unlocking that automation power.

6:37

What are the main building blocks for a beginner?

6:39

Yeah, definitely.

6:40

You can think of it like setting up

6:42

a big physical filing system makes it easier to grasp.

6:45

So the first main concept is the data set.

6:47

If you're using the Python interface,

6:49

this is sort of your top level thing.

6:51

Think of the data set like your main binder

6:52

or the whole filing cabinet.

6:54

You use it to set up your data, connect to the database file

6:56

where everything's stored, kick off the visual interface

6:59

if you need it.

7:00

And critically, it's what you use

7:02

to run your queries and selections

7:03

to find specific data.

7:05

OK, so data set is the container.

7:07

Then what's inside?

7:08

Inside, you have the sample.

7:10

So if the data set is the binder,

7:11

the sample is like a single page or a single file inside it.

7:15

It's just one data instance.

7:17

Could be one image, one audio clip, whatever.

7:20

And what does it know about itself?

7:22

It holds all the key info, the unique ID,

7:25

like a serial number, the file name

7:27

where it lives on the disk, the absolute path,

7:29

and importantly, a list of descriptive tags.

7:32

Tags could be anything like reviewed,

7:34

needs labeling, nighttime, simple labels you attach.

7:38

And it also gives you access to the metadata.

7:40

That's all the other descriptive stuff about the file image

7:43

resolution when it was captured.

7:45

Maybe GPS coordinates, depends on the data.

7:48

Dataset holds samples, samples hold info and tags.

7:50

What's the third piece?

7:51

The third piece is the real power move, data set queries.

7:55

This is how you find very specific subsets of your data

7:58

without looking through potentially millions of files

8:00

manually.

8:01

Queries let you combine filtering, sorting, slicing,

8:05

using standard logic, those Boolean expressions,

8:08

and D, or not Dame.

8:09

So is this just like filtering columns in a spreadsheet,

8:12

or is it more powerful with this kind of data?

8:15

It's way more powerful because you

8:16

can query based on the tags and the metadata at the same time.

8:21

So for instance, you could build a query like,

8:23

find all samples that are tagged, needs labeling, or are.

8:27

Find samples where the image width is less than 500 pixels

8:30

in D, they have not been tagged as reviewed yet.

8:34

OK, so you can get really specific to find

8:36

potential problems or gaps.

8:38

Exactly.

8:38

It finds that precise set of data

8:40

that maybe slipped through your initial checks

8:42

or needs special attention.

8:43

That feels like the big win right there,

8:45

turning data prep from this manual slog

8:48

into something you can script.

8:49

Precisely.

8:50

And once you run that query, the really useful part

8:53

is you can then immediately do something with that subset,

8:55

like apply a new tag to all of them, say, needs review.

8:59

Then later in the visual interface,

9:01

they're super easy to find and work on just

9:03

by filtering for that tag.

9:04

Nice.

9:05

And for beginners, getting data in

9:06

seems pretty straightforward, too.

9:08

The sources mention easy ways to load data

9:10

from common formats, like YOLO.

9:12

That's for object detection, right, bounding boxes.

9:14

Yeah.

9:15

And Cocoa, which is often used for instant segmentation

9:19

or image captions, you just use simple Python functions,

9:22

like add samples from Milo or add samples from Cocoa.

9:26

Keeping the barrier to entry low for common formats.

9:28

Good.

9:29

Now, this feels like a good transition

9:30

to that feature you mentioned earlier,

9:32

the one that really shows off the platform's advanced side.

9:34

Selection.

9:35

You said this is where it saves real money and time.

9:37

Yeah.

9:38

This is arguably the core IP, the really smart bit.

9:41

Selection is basically automated data selection.

9:44

And the purpose is simple, but huge.

9:47

Save potentially massive labeling costs

9:49

and cut down training time while actually improving

9:52

your final model quality.

9:54

How does that work?

9:55

If I have, say, a million images,

9:57

but only budget to label a few hundred,

9:59

how does it pick the best hundred?

10:00

That sounds tricky.

10:01

It is tricky.

10:02

That's why automation helps.

10:03

It avoids human bias and, frankly,

10:05

the tediousness of trying to ensure variety manually.

10:08

The mechanism works by automatically picking

10:10

the samples considered most useful.

10:13

And it does this by balancing two key factors that models

10:16

need to be robust.

10:17

OK, what are the two factors?

10:18

First, you need representative samples.

10:21

This is your core data, the typical stuff

10:23

your model will see 95% of the time, the normal cases.

10:26

Right, the bread and butter.

10:27

But if you only train on that, your model

10:29

falls apart the moment something slightly unusual happens.

10:33

So second, you need diverse samples.

10:35

These are the crucial edge cases,

10:37

the novel or rare examples, the stuff the model hasn't really

10:40

seen before but needs to handle.

10:42

OK, so it's like, if I'm training a self-driving car

10:44

model, I need lots of pictures of normal daytime driving.

10:47

That's your representative data.

10:49

But I also absolutely need examples

10:53

of driving in heavy rain at night, maybe with a weird object

10:56

on the road.

10:57

Those are your diverse edge case samples.

10:59

Exactly.

10:59

Selection aims to pick a subset that

11:01

intelligently balances both.

11:03

So if I just labeled 100 pictures

11:05

of the same boring highway on a sunny day,

11:07

I've kind of wasted 99 labels.

11:09

Selection stops that.

11:11

That's the idea.

11:11

It forces variety into your labeled set.

11:14

And you, the user, get to control this balance.

11:18

You can use different strategies.

11:19

For example, a metadata weighting strategy.

11:23

Maybe you tell it to prioritize samples tag, nighttime,

11:26

or rainy, because you know those are hard cases for your model.

11:29

OK, using the tags we talked about earlier.

11:30

Right.

11:31

Or you could use something like an embedding diversity

11:33

strategy.

11:34

This is more AI driven.

11:36

It looks at the actual visual content

11:38

using embeddings, numerical representations of the images,

11:42

and picks samples that are mathematically distant

11:45

or different from what's already been selected,

11:48

even if a human didn't explicitly tag it as diverse.

11:51

Wow, OK, that sounds powerful.

11:53

It leads to some pretty impressive results,

11:54

according to the sources.

11:56

They cite things like up to an 80% cut in annotation costs.

11:59

80%.

12:00

Yeah.

12:00

And model iteration cycles getting three times faster.

12:03

Plus, actual accuracy bumps, sometimes 10%.

12:05

Sometimes even sizes as high as 36%.

12:08

That's significant.

12:09

It really underlines that focusing on data quality

12:11

over just raw quantity pays off.

12:14

Absolutely.

12:15

The return on investment for curating your data smartly

12:18

is undeniable.

12:19

So just to wrap up the main idea for the listener,

12:22

Lightly Studio positions itself as that central data hub.

12:24

It brings together the management, the labeling,

12:27

and this intelligent automated curation using selection.

12:31

And it also links up with their other tools,

12:33

like LightlyTrain for pre-training models

12:35

and LightlyEdge for optimizing data collection out

12:37

in the field.

12:38

So you get the open source flexibility and cost benefits,

12:40

but paired with enterprise level security rigor.

12:43

They mentioned being ISO 27001 certified,

12:46

which is important for businesses.

12:48

All right, that security aspect is key for adoption.

12:51

Okay, that level of control over the data

12:53

brings us to our final thought for you,

12:54

the listener, to maybe chew on.

12:56

We tend to get really fixated on picking

12:58

the perfect model architecture,

12:59

tweaking the training parameters,

13:01

but maybe the real leverage, the real power,

13:03

actually lies in data curation.

13:05

So here's the question.

13:07

If you could only afford to label, say,

13:10

100 images for your next big computer vision project,

13:13

how confident are you right now

13:16

that those specific 100 images would be the absolute best,

13:19

most useful, most diverse set possible

13:22

to train your model effectively?

13:24

Tools like Lightly Studio are fundamentally designed

13:26

to take the guesswork out of that question

13:28

and give you some actual measurable certainty,

13:30

something to think about.

13:32

Well, thank you for joining us for this deep dive

13:33

into data management for multimodal ML.

13:36

And just one final reminder

13:37

that this deep dive was supported by Safe Server.

13:39

Safe Server supports your digital transformation

13:41

and handles hosting for this kind of software.

13:43

We'll see you next time on the deep dive.

13:43

We'll see you next time on the deep dive.

Today's Deep-Dive: Lightly Studio

Episode description

Persons