Today's Deep-Dive: Lightly Studio
Ep. 308

Today's Deep-Dive: Lightly Studio

Episode description

Lightly Studio is an open-source platform designed to streamline data management, curation, and annotation for multimodal machine learning projects. It addresses the common bottleneck in AI development, where data quality and management, rather than algorithms, often hinder progress. The platform consolidates crucial steps like data curation, annotation, and management into a unified workflow, aiming to automate the often-tedious process of data wrangling. Built with Rust for high performance, Lightly Studio can handle massive datasets on standard hardware, making it efficient and accessible. Its core functions include label and quality assurance tools, data understanding and visualization for identifying duplicates and edge cases, intelligent data curation, and flexible export options for deployment. The platform supports a wide range of data types, including images, video, audio, text, and specialized formats like DICOM, serving as a versatile hub for diverse data needs. Lightly Studio bridges the gap between ML engineers, who benefit from its SDKs, APIs, and automation capabilities, and less technical users like labelers and project managers, who are supported by an intuitive GUI and collaboration features. A key feature is its automated data selection capability, which intelligently balances representative and diverse samples to reduce labeling costs by up to 80%, accelerate training cycles, and improve model accuracy. This focus on smart data curation, rather than just quantity, offers a significant return on investment, positioning Lightly Studio as a central data hub that enhances flexibility, cost-effectiveness, and enterprise-level security.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now!

Download transcript (.srt)
0:00

Welcome back to the deep dive.

0:02

If you've been watching fields like computer vision

0:04

or multimodal machine learning grow,

0:08

you quickly realize something.

0:10

It's often not the algorithms holding things back.

0:13

It's the data, like managing it, cleaning it up.

0:16

Yeah, and figuring out what's actually worth labeling.

0:18

That's a huge one.

0:19

Exactly.

0:20

So today, we're doing a deep dive into a platform built

0:24

specifically for that problem, Lightly Studio.

0:26

We've gathered a bunch of sources

0:28

to really unpack this open source tool.

0:30

Our goal here is pretty simple.

0:32

Break down what Lightly Studio is, who it's for,

0:35

and walk through the key ideas that, well,

0:37

aim to turn messy data wrangling into something more automated.

0:40

Try to make it approachable, even if you're just

0:42

getting started in this space.

0:44

But before we jump into the data pipelines themselves,

0:47

we really want to thank the supporter of this deep dive,

0:49

SafeServer.

0:50

SafeServer handles hosting for exactly this kind

0:53

of specialized software, and they support

0:55

your digital transformation.

0:57

So when you need to deploy tools like Lightly Studio,

0:59

they provide that crucial infrastructure.

1:02

You can find out more and start your own journey

1:04

at www.safeserver.de.

1:06

And that infrastructure piece is really key,

1:09

because scaling these modern ML projects,

1:12

it often hits a wall when the data just gets out of control.

1:15

You've got to curate it, label it right,

1:16

keep track of every version, every change.

1:19

It's a lot.

1:20

Lightly Studio really pitches itself

1:22

as that unified data platform for multimodal ML.

1:26

The idea is consolidating all those tricky, separate steps

1:29

into one place, make it manageable.

1:31

OK, so let's start right there.

1:33

What is this tool, fundamentally?

1:35

We know it's from Lightly, it's open source.

1:37

But what are those main tasks it's trying to unify?

1:39

Right, it's about unifying that whole workflow, curation,

1:43

annotation, and management.

1:46

It was really built by ML engineers

1:47

for ML engineers and organizations

1:49

that are trying to scale up computer vision work.

1:51

They recognize that, well, speed and flexibility are paramount.

1:54

At speed, you mentioned, how does it achieve that?

1:56

Well, the sources point to a pretty key architectural choice.

1:59

It's built using Rust.

2:02

Rust, OK.

2:03

Yeah, and Rust is known for performance, right?

2:06

And memory safety.

2:07

So what that means for someone using it is efficiency.

2:10

You can actually handle massive data sets like Kokyo or ImageNet

2:13

Scale and do some pretty heavy processing,

2:17

even on, say, standard hardware.

2:19

Not necessarily a giant server farm.

2:21

Like what kind of standard hardware?

2:23

I think like a decent laptop, an M1 MacBook Pro with maybe 16 gigs

2:28

of RAM, that kind of thing.

2:29

OK, that's impressive.

2:30

So it's got this powerful engine.

2:32

What are the actual functions?

2:34

What does the platform do day to day?

2:36

Yeah, there are really four core functions

2:38

that kind of cover the whole data lifecycle representing

2:41

its main value.

2:42

First up, you've got label and QA,

2:44

so built-in tools for annotating images and videos.

2:46

Pretty essential.

2:48

Second, it helps you understand and visualize your data.

2:51

So you can filter it, automatically find exact duplicates,

2:53

which honestly saves so much time,

2:56

spot those really important edge cases,

2:57

and also catch data drift

3:00

as your real-world conditions change.

3:02

Okay, makes sense.

3:03

Third, it lets you intelligently curate data.

3:06

This means automatically selecting the samples

3:08

that are actually the most valuable for training

3:11

or fine-tuning your model.

3:13

We'll probably dig into that more later.

3:14

Yeah, definitely want to circle back to that.

3:16

And the last one.

3:17

And finally, after you've done all that work,

3:19

you need to export and deploy that curated data set.

3:22

And it lets you do that whether you're running Lightly Studio

3:24

on your own machines, on-prem, using a hybrid cloud setup,

3:28

or fully in the cloud.

3:30

Gotcha, flexible deployment.

3:31

Exactly, and you know, we keep saying multimodal ML.

3:35

We should probably clarify that a bit.

3:36

Good point.

3:37

It means the platform isn't just for like standard photos.

3:40

It really supports a wide range of data types.

3:43

Images, sure, but also video clips, audio files, text,

3:48

and importantly, even specialized formats like DICOM data.

3:51

Oh, the medical imaging format.

3:53

That's the one, yeah, for X-rays, MRIs, that kind of thing.

3:56

So having that breadth really makes it a genuinely unified

4:00

hub for different data types.

4:03

OK, so it handles a lot of data types.

4:05

Let's unpack who actually uses this.

4:08

Who benefits from this unification?

4:10

Yeah.

4:10

Because the source has mentioned like two main groups that

4:12

don't always talk to each other.

4:14

Yeah, that's a good way to put it.

4:15

It tries to bridge that gap.

4:16

So on one side, you've got the ML engineers, data scientists,

4:21

the infrastructure folks.

4:22

Yeah.

4:23

What's in it for them specifically?

4:25

Right.

4:25

For the engineers, it's really all

4:27

about integration and automation.

4:29

It's built with SDKs and API support, all based

4:33

on open source standards.

4:35

So the idea is it slots into their existing ML stacks

4:38

pretty easily without needing a total rewrite of everything.

4:42

Less disruption.

4:42

Exactly.

4:43

And automation is key, right?

4:44

That's handled mainly through the Python SDK.

4:47

So they can script everything, importing data, managing it.

4:51

They can pull data straight from where they usually

4:53

keep it, like local folders or cloud storage, like S3 or GCS.

4:58

There's big ones from Amazon and Google Cloud, right?

5:00

Yeah, the standard object storage.

5:02

And crucially, once data is in, it's

5:04

not like it's locked forever.

5:05

You can keep adding new data to existing data

5:08

sets as you get it, which is, well,

5:11

vital for any kind of research or iterative development.

5:14

Absolutely.

5:14

OK, so that's the engineers.

5:16

Then on the other side, you mentioned labelers and project

5:18

managers.

5:19

These might be less technical users.

5:21

Often, yeah.

5:22

They're focused on the quality assurance, managing

5:24

large teams, logistics.

5:27

For them, the platform emphasizes

5:30

more intuitive workflows.

5:32

So a user-friendly GUI, tools for collaboration,

5:36

and really critical features, like data set versioning.

5:38

Oh, versioning is huge.

5:39

Anyone who's tried to reproduce an old result knows that pain.

5:42

Totally.

5:43

Knowing exactly which version of the labels

5:45

you used six months ago, yeah, it's crucial.

5:48

Plus, things like role-based permissions

5:50

to manage who can do what in the annotation team

5:52

make sense for project managers.

5:54

So it's bridging these two worlds.

5:55

And interestingly, it seems very focused

5:58

on making it easy to switch to Lightly Studio.

6:00

The sources actually call out that it simplifies migrating

6:03

data from other tools.

6:04

Oh, like competitors.

6:05

Yeah, they mention names like Encore, Voxel 51, Ultralytics,

6:09

V7 Labs, Roboflow, popular tools in the space.

6:13

It seems like they actively want to be that central data hub,

6:16

reducing the friction if the team decides, OK,

6:19

we need to consolidate onto one platform.

6:21

OK, that makes strategic sense.

6:22

Now, here's where I think it gets really interesting,

6:25

especially for folks wanting to automate things.

6:27

The Python interface.

6:29

We don't need to become Python experts here.

6:31

But understanding the basic concepts

6:33

seems key to unlocking that automation power.

6:37

What are the main building blocks for a beginner?

6:39

Yeah, definitely.

6:40

You can think of it like setting up

6:42

a big physical filing system makes it easier to grasp.

6:45

So the first main concept is the data set.

6:47

If you're using the Python interface,

6:49

this is sort of your top level thing.

6:51

Think of the data set like your main binder

6:52

or the whole filing cabinet.

6:54

You use it to set up your data, connect to the database file

6:56

where everything's stored, kick off the visual interface

6:59

if you need it.

7:00

And critically, it's what you use

7:02

to run your queries and selections

7:03

to find specific data.

7:05

OK, so data set is the container.

7:07

Then what's inside?

7:08

Inside, you have the sample.

7:10

So if the data set is the binder,

7:11

the sample is like a single page or a single file inside it.

7:15

It's just one data instance.

7:17

Could be one image, one audio clip, whatever.

7:20

And what does it know about itself?

7:22

It holds all the key info, the unique ID,

7:25

like a serial number, the file name

7:27

where it lives on the disk, the absolute path,

7:29

and importantly, a list of descriptive tags.

7:32

Tags could be anything like reviewed,

7:34

needs labeling, nighttime, simple labels you attach.

7:38

And it also gives you access to the metadata.

7:40

That's all the other descriptive stuff about the file image

7:43

resolution when it was captured.

7:45

Maybe GPS coordinates, depends on the data.

7:48

Dataset holds samples, samples hold info and tags.

7:50

What's the third piece?

7:51

The third piece is the real power move, data set queries.

7:55

This is how you find very specific subsets of your data

7:58

without looking through potentially millions of files

8:00

manually.

8:01

Queries let you combine filtering, sorting, slicing,

8:05

using standard logic, those Boolean expressions,

8:08

and D, or not Dame.

8:09

So is this just like filtering columns in a spreadsheet,

8:12

or is it more powerful with this kind of data?

8:15

It's way more powerful because you

8:16

can query based on the tags and the metadata at the same time.

8:21

So for instance, you could build a query like,

8:23

find all samples that are tagged, needs labeling, or are.

8:27

Find samples where the image width is less than 500 pixels

8:30

in D, they have not been tagged as reviewed yet.

8:34

OK, so you can get really specific to find

8:36

potential problems or gaps.

8:38

Exactly.

8:38

It finds that precise set of data

8:40

that maybe slipped through your initial checks

8:42

or needs special attention.

8:43

That feels like the big win right there,

8:45

turning data prep from this manual slog

8:48

into something you can script.

8:49

Precisely.

8:50

And once you run that query, the really useful part

8:53

is you can then immediately do something with that subset,

8:55

like apply a new tag to all of them, say, needs review.

8:59

Then later in the visual interface,

9:01

they're super easy to find and work on just

9:03

by filtering for that tag.

9:04

Nice.

9:05

And for beginners, getting data in

9:06

seems pretty straightforward, too.

9:08

The sources mention easy ways to load data

9:10

from common formats, like YOLO.

9:12

That's for object detection, right, bounding boxes.

9:14

Yeah.

9:15

And Cocoa, which is often used for instant segmentation

9:19

or image captions, you just use simple Python functions,

9:22

like add samples from Milo or add samples from Cocoa.

9:26

Keeping the barrier to entry low for common formats.

9:28

Good.

9:29

Now, this feels like a good transition

9:30

to that feature you mentioned earlier,

9:32

the one that really shows off the platform's advanced side.

9:34

Selection.

9:35

You said this is where it saves real money and time.

9:37

Yeah.

9:38

This is arguably the core IP, the really smart bit.

9:41

Selection is basically automated data selection.

9:44

And the purpose is simple, but huge.

9:47

Save potentially massive labeling costs

9:49

and cut down training time while actually improving

9:52

your final model quality.

9:54

How does that work?

9:55

If I have, say, a million images,

9:57

but only budget to label a few hundred,

9:59

how does it pick the best hundred?

10:00

That sounds tricky.

10:01

It is tricky.

10:02

That's why automation helps.

10:03

It avoids human bias and, frankly,

10:05

the tediousness of trying to ensure variety manually.

10:08

The mechanism works by automatically picking

10:10

the samples considered most useful.

10:13

And it does this by balancing two key factors that models

10:16

need to be robust.

10:17

OK, what are the two factors?

10:18

First, you need representative samples.

10:21

This is your core data, the typical stuff

10:23

your model will see 95% of the time, the normal cases.

10:26

Right, the bread and butter.

10:27

But if you only train on that, your model

10:29

falls apart the moment something slightly unusual happens.

10:33

So second, you need diverse samples.

10:35

These are the crucial edge cases,

10:37

the novel or rare examples, the stuff the model hasn't really

10:40

seen before but needs to handle.

10:42

OK, so it's like, if I'm training a self-driving car

10:44

model, I need lots of pictures of normal daytime driving.

10:47

That's your representative data.

10:49

But I also absolutely need examples

10:53

of driving in heavy rain at night, maybe with a weird object

10:56

on the road.

10:57

Those are your diverse edge case samples.

10:59

Exactly.

10:59

Selection aims to pick a subset that

11:01

intelligently balances both.

11:03

So if I just labeled 100 pictures

11:05

of the same boring highway on a sunny day,

11:07

I've kind of wasted 99 labels.

11:09

Selection stops that.

11:11

That's the idea.

11:11

It forces variety into your labeled set.

11:14

And you, the user, get to control this balance.

11:18

You can use different strategies.

11:19

For example, a metadata weighting strategy.

11:23

Maybe you tell it to prioritize samples tag, nighttime,

11:26

or rainy, because you know those are hard cases for your model.

11:29

OK, using the tags we talked about earlier.

11:30

Right.

11:31

Or you could use something like an embedding diversity

11:33

strategy.

11:34

This is more AI driven.

11:36

It looks at the actual visual content

11:38

using embeddings, numerical representations of the images,

11:42

and picks samples that are mathematically distant

11:45

or different from what's already been selected,

11:48

even if a human didn't explicitly tag it as diverse.

11:51

Wow, OK, that sounds powerful.

11:53

It leads to some pretty impressive results,

11:54

according to the sources.

11:56

They cite things like up to an 80% cut in annotation costs.

11:59

80%.

12:00

Yeah.

12:00

And model iteration cycles getting three times faster.

12:03

Plus, actual accuracy bumps, sometimes 10%.

12:05

Sometimes even sizes as high as 36%.

12:08

That's significant.

12:09

It really underlines that focusing on data quality

12:11

over just raw quantity pays off.

12:14

Absolutely.

12:15

The return on investment for curating your data smartly

12:18

is undeniable.

12:19

So just to wrap up the main idea for the listener,

12:22

Lightly Studio positions itself as that central data hub.

12:24

It brings together the management, the labeling,

12:27

and this intelligent automated curation using selection.

12:31

And it also links up with their other tools,

12:33

like LightlyTrain for pre-training models

12:35

and LightlyEdge for optimizing data collection out

12:37

in the field.

12:38

So you get the open source flexibility and cost benefits,

12:40

but paired with enterprise level security rigor.

12:43

They mentioned being ISO 27001 certified,

12:46

which is important for businesses.

12:48

All right, that security aspect is key for adoption.

12:51

Okay, that level of control over the data

12:53

brings us to our final thought for you,

12:54

the listener, to maybe chew on.

12:56

We tend to get really fixated on picking

12:58

the perfect model architecture,

12:59

tweaking the training parameters,

13:01

but maybe the real leverage, the real power,

13:03

actually lies in data curation.

13:05

So here's the question.

13:07

If you could only afford to label, say,

13:10

100 images for your next big computer vision project,

13:13

how confident are you right now

13:16

that those specific 100 images would be the absolute best,

13:19

most useful, most diverse set possible

13:22

to train your model effectively?

13:24

Tools like Lightly Studio are fundamentally designed

13:26

to take the guesswork out of that question

13:28

and give you some actual measurable certainty,

13:30

something to think about.

13:32

Well, thank you for joining us for this deep dive

13:33

into data management for multimodal ML.

13:36

And just one final reminder

13:37

that this deep dive was supported by Safe Server.

13:39

Safe Server supports your digital transformation

13:41

and handles hosting for this kind of software.

13:43

We'll see you next time on the deep dive.

13:43

We'll see you next time on the deep dive.