Okay, so get this. Imagine you're so deep into Kubernetes.
Like you're giving a KubeCon talk about your setup.
You've got to handling millions of users.
But then you get it.
That's what we're diving into today.
Why Gitpod decided Kubernetes,
specifically for their developer environments,
wasn't working.
Yeah, it's really interesting
because they're not saying Kubernetes is bad, right?
They're saying it's not the right tool
when it comes to developer environment.
Exactly, we're looking at their blog post
from October 31st, 2024.
It's this like six year saga of them trying to make it work,
hitting all these roadblocks.
They came up with some pretty interesting workarounds.
Oh yeah.
You almost feel bad for them,
but you learn a lot whether you're deep into Kubernetes
or just curious about developer tools.
Yeah, and it just shows that even teams
with tons of experience, even huge teams,
sometimes have to take a step back and look at their tools.
Right.
You've got to pick the right tool for the job.
Right.
It doesn't have to be the popular one.
Yeah, okay.
So Gitpod's main argument is running applications
in production, that's where Kubernetes shines.
Yeah.
But developer environments, that's a whole different beast.
Totally.
And the blog post breaks down why they highlight
these four characteristics of developer environments.
The first being that they're super stateful
and interactive.
So you've got gigabytes of source code,
you've got build caches, you've got containers running.
All that is constantly changing.
It's not like a stateless app.
Your developer environment is basically an extension of you.
Yeah, it's like the difference between a pristine server room
and your desk.
Your desk has projects everywhere and coffee mugs.
Exactly.
And that mess is really valuable to developers.
So you can imagine it's a huge pain
if they lose changes or get interrupted.
And that leads us to the second characteristic,
unpredictable resource usage.
So you might be coding along, and suddenly, bam,
you need tons of CPU for compilation.
Or memory usage might spike.
Yeah, and Kubernetes isn't really
known for loving surprises.
Not really, no.
Gitpod talks about all the struggles
they had with CPU throttling.
Your terminal's lagging because your IDE is fighting
some random process for resources.
They did all kinds of stuff.
Custom controllers, messing with process priorities,
even tweaking Cgroups V2.
Yeah, and for those who don't know,
Cgroups V2 is how the Linux kernel organizes processes
into these hierarchical groups.
It's for controlling and monitoring
things like CPU and memory and disk I.O.
It's very fine-grained control, but it's complex.
Yeah, it sounds like they went really deep.
Deep down the rabbit hole.
And remember, this is all happening
inside a single container because that's
the way Kubernetes works.
So all these processes crammed together,
it just makes resource usage a total guessing game.
Right.
OK, so then there's memory management.
Apparently, until SwapSpace was available in Kubernetes
version 1.22, overbooking memory was a pretty big risk.
Like, you could end up killing a central processor,
you imagine.
Developer rage.
Yeah.
I mean, this just shows that even mature technologies
like Kubernetes can have limitations, especially
for specific use cases, right?
It's really important to evaluate
whether a tool's strengths really fit what you need it for.
They must've been, I mean, can you imagine?
Pulling their hair out.
Yeah.
Yeah.
OK, so then we have storage performance.
Gitpod really hammers on about how much this matters,
not just for how fast your environment starts up,
but your whole experience inside the environment.
Yeah, because if you're waiting for files to load
or for builds to finish, it just kills your flow.
Totally.
And they tried everything.
SSD, rate zero for speed, a little risky.
Then block storage for availability,
but they hit a wall with persistent volume claims, or PVCs.
For those who aren't deep into Kubernetes,
explain why PVCs were such a pain.
Sure.
So PVCs, it's like this abstraction layer
that lets you use storage.
You don't have to worry about the underlying hardware,
so it's flexible.
But in practice, when these PVCs would attach or detach,
it was unpredictable, and that messed with their attempts
to make workspace startups super fast.
They also ran into some reliability issues,
especially on Google Cloud.
So you're a developer, you're ready to code,
and your whole environment just crashes.
Yeah.
Not a good look.
Talk about a buzzkill.
And then there's backing up and restoring these environments.
They can get huge, right?
Right.
So moving them around became this balancing act
of I-O, network bandwidth, and CPU.
Wow.
They even had to use sick group-based I-O limiters
to prevent one workspace from hogging all the resources
and then starving the others.
It's crazy how these things that sound simple get so complex.
Totally.
Speaking of complex, another challenge?
Autoscaling and startup time.
Yeah.
They were obsessed with minimizing
that initial wait time.
Of course, yeah.
But that clashed with their desire
to use their machines as efficiently as possible.
Yeah, I mean Kubernetes by design
has this inherent lower limit on startup time, right?
Right.
Because of all the steps involved,
moving content around, spinning up containers.
So they started off thinking, let's just
run multiple workspaces on one node
to leverage shared caches.
But that didn't really work out.
Didn't quite work out, no.
So they tried some creative solutions.
They tried something they called ghost workspaces.
Ghost workspaces.
Yeah.
So these were preemptible pods that would just
sit there to hold space so they could scale in advance.
They're like phantom developers taking up space.
That's a good way to put it.
Clever, but too slow and unreliable.
Then they tried ballast pods.
So these were entire nodes filled with dummy pods
just to ensure that they had enough capacity.
Kind of like renting out an empty apartment building
just in case you might need it later.
Pretty much not efficient.
Finally, they landed on cluster autoscaler plugins,
which is a much more elegant solution.
But it took a while to get there.
They even implemented proportional autoscaling,
which basically controls the rate of scale up.
It's based on how quickly devs are starting new environments.
So if there's a sudden rush, they
can add capacity quickly without overshooting.
It's all about finding that balance
between being responsive and making
the most of your resources.
My brain's hurting.
Anyone else?
OK.
Image polls, another headache.
Workspace container images can be huge.
We're talking like 10 gigabytes or more.
And that impacts performance when you have to download and extract
that much data for every workspace.
Yeah, it's like downloading the entire Library of Congress
every time you want to read a book.
Right.
So they tried pre-pulling images with demon sets, which
are basically agents on every node making sure the images are ready.
Then they tried building their own custom images
to maximize layer reuse, even baking images directly
into the node disk image.
Yeah, each of those came with their own trade-offs, right?
Increased complexity, higher costs, limits
on what images devs could use.
Again, another example of how something seemingly simple
can get really complicated at scale.
Yeah, and they even built their own registry facade.
They integrated it with IPFS, the Interplanetary File System,
that decentralized way to store and share files.
They were so proud of it.
They gave a whole KubeCon talk about it.
But in the end, the best solution
was just encouraging everyone to use similar base images,
making caching a lot more effective.
Sometimes the simplest answer really is the best one.
But getting there takes some effort.
OK, buckle up.
We're going into the world of networking in Kubernetes.
And this is where it gets a little technical.
This is where the conflict between what Kubernetes assumes
and what developer environments need becomes really clear.
Yeah, you've got the issue of access control.
You want each environment to be its own little island.
So walled gardens for every developer.
Exactly.
So no peeking at your neighbor's code.
And you need to control who can access what.
Kubernetes has these things called network policies.
They're for defining fine-grained rules
about what traffic can flow within the cluster.
Sounds great, but even those cause headaches for Gitpod.
Of course they did.
So what was their initial approach?
So they started using Kubernetes services and an ingress proxy.
It's to manage access to individual environment ports.
Think your IDE or services running within the workspace.
But as they scaled, this approach became unreliable.
Because more users equals more complexity equals more things
that can go wrong.
Exactly.
With thousands of environments running simultaneously,
name resolution started failing.
Sometimes, it even crashed entire workspaces.
Even established Kubernetes features
have their limits when you push them to the extreme.
It's a good reminder that scaling isn't just
about making things bigger.
No.
It's about making sure they can handle all the complexity that
comes with size.
OK, so resource constraints, another area
where Gitpod face challenges, network bandwidth sharing.
It's like having multiple apartments sharing
the same internet connection, and everyone
wants to stream movies at the same time.
Yeah, just like CPU and memory, you've
got multiple workspaces on a node, all competing
for that same network pipe.
Some container network interfaces, or CNIs,
have features for network shaping,
but that adds even more complexity.
And then there's the question of fairness.
How do you divide up that bandwidth
so everyone gets a decent slice?
It's a never-ending battle.
Balancing performance, security, making
the most of your resources.
And that brings us to, I think, one of the hairiest topics.
Security.
Specifically in the context of developer environments.
How do you give developers the freedom
they need without creating a security nightmare?
This is where the tension between flexibility and control
really comes in.
It gets complicated.
So they start by highlighting this naive approach.
Just give everyone root access to their containers.
Seems simple, right?
Yeah, just give everyone the keys to the kingdom.
What could go wrong?
Well, aside from being a security disaster waiting
to happen, giving users root in their containers
basically gives them root on the node itself.
That means they can potentially swoop around
in other environments that are running on the same node.
They could mess with the infrastructure.
Yeah, not good.
Not exactly what you want.
Not stable.
So they needed something more sophisticated.
Enter user namespaces.
So this is a Linux kernel feature
that lets you map user and group IDs inside containers.
So you can basically make a user feel
like they have root privileges within their environment,
but without actually giving them control over the host system.
OK, that sounds clever, but I bet it wasn't easy to set up.
You bet it wasn't.
Kubernetes did eventually add support for user namespaces
in version 1.25, but Gitpod had already
started their own implementation with version 1.22.
And let me tell you, their solution
involves some serious technical gymnastics.
Give us the highlights.
What kind of gymnastics?
Well, for starters, they had to implement something
called file system UID shifting.
This ensures that files that are created inside the container
are mapped correctly to user IDs on the host system.
So it prevents any security bypasses.
They tried a bunch of different approaches,
like shifts, fuse overlays, even id mapped mounts.
Each of those had their own quirks
in terms of performance and compatibility.
It sounds like they were really pushing the limits of what
Kubernetes could do, trying to fit a square peg
into a round hole.
Exactly.
And then there was a challenge of mounting
what they call a masked proc file system.
So usually when a container starts up, it mounts proc.
This gives it access to information
about the host system.
But for Gitpod's security model, proc
had to be hidden to prevent vulnerabilities.
So they had to create this custom masked proc
and then carefully move it into the right mount
namespace for each container.
And they did this using seccomp notify,
which is like a super low level way to intercept and modify
system calls.
Pretty hardcore stuff.
Wow, it's like they're doing brain surgery on Kubernetes
to make it work.
Pretty much.
But wait, there's more.
They also needed to add support for FUSE file
system in user space.
Yeah.
A lot of developer tools rely on that.
So this involved messing with the container's EBPF device
filter, another low level tweak.
And then there's the issue of network capabilities.
Right.
So as root, you have these powerful capabilities
like KAPNA TADBEN and KAPNA TRAW.
They let you control networking.
Right.
So giving those to a container would totally
break their security model.
Yeah.
So how did they get around that?
Well, they ended up creating another network namespace,
but this time inside the Kubernetes container.
Initially, they used sloop fornets.
And then they switched to veth pairs and custom NF tables
rules.
It's like they were building a secure little networking
sandbox within another sandbox.
It's amazing how much work they put into making this all work.
It really is.
But all this complexity comes with a price, right?
You've got performance hits, especially
with the earlier solutions.
You've got compatibility issues with certain tools.
And then the never-ending struggle
to keep up with Kubernetes updates.
So you can see why they started looking for alternatives.
And that's where their exploration of micro VMs comes
in.
But we're going to save that for part two.
Stay tuned, folks.
Things get really interesting.
Welcome back.
If you're just tuning in, we're talking
about Gitpod's journey, how they went from Kubernetes fans
to creating their own system for developer environments.
Yeah, it got to the point where they were willing to try
anything, even something completely
different from Kubernetes.
Right.
So that's where micro VMs come in.
Now, for those of us who aren't living
in the infrastructure world, can you give us a micro VMs 101?
What are they?
And why was Gitpod so interested?
Sure.
So think of micro VMs like tiny specialized virtual machines,
right?
Strip down to just the essentials.
They boot up super fast, small footprint,
and security is kind of baked into their design.
Gitpod was looking at technologies
like Firecracker, Cloud Hypervisor, QEMU.
So what was it about micro VMs that they were so excited about?
What problems were they hoping to solve that Kubernetes just
wasn't cutting it for?
Well, first and foremost, better resource isolation.
Unlike containers, which share the host's kernel, micro VMs,
they get their own dedicated kernel.
So that means less chance of one environment interfering
with another, more predictable performance overall.
So no more laggy terminal, because your IDE is fighting
some compiler process for CPU.
Exactly.
Another big plus, memory snapshots, near instant resume.
With something like Firecracker, you
can take a snapshot of the entire VM's memory state,
and that includes everything that's running.
You can restore it in an instant.
Wait, so you're saying you could literally
pause your whole developer environment, mid-debug
session, coffee break, whatever, and come back to it
exactly as you left it.
That's the power of micro VMs.
Imagine the productivity boost, especially
for large projects, complex projects,
where restarting everything can take forever.
Yeah, that's a feature I think a lot of developers would love.
For sure.
But I'm guessing there were some downsides, right?
Otherwise, Gitpod would have just switched over
and called it a day.
Of course, no technology is perfect.
One challenge was overhead.
Even though micro VMs are lightweight
compared to like traditional VMs,
they still add more overhead than containers.
And that impacts performance, resource utilization,
which for a platform like Gitpod is a huge deal.
Right, because they're running thousands, if not millions,
of these environments.
Exactly.
Every little bit of efficiency matters.
Another hurdle was image conversion.
Most developer tools, they come packaged as container images
using the OCI standard.
Kubernetes loves that.
But to use those images in a micro VM,
you have to convert them to a format
that the micro VM understands, that adds complexity
and slows down startup.
Right, so it's not just as simple
as swapping out Kubernetes and plugging in micro VMs.
No, it's a whole translation process.
And then there are some limitations
that are specific to micro VM technologies themselves.
For example, Firecracker, which is known for its speed
and its snapshotting.
Well, at the time, it didn't support
GPUs, which is a deal breaker if you're
working on graphics intensive applications.
OK, so even cutting edge technology
has its limitations.
What else did they run into?
Well, data movement became a much bigger problem.
With micro VMs, you're dealing with whole VM images,
including those memory snapshots, which
can be pretty large.
Moving them around, whether it's for backups or scheduling,
gets more complex and it takes more time.
And I bet storage, which was already a pain point,
became even more of a headache.
You got it.
They tried attaching EBS volumes,
that's elastic block storage, from AWS to their micro VMs,
thinking that they could improve startup times
and reduce network strain by keeping the workspace
data local.
But then you run into all these performance quotas, latency
issues, and just the challenge of scaling that approach
across a huge platform.
So kind of swapping one set of problems for another.
In a way.
But the micro VM detour, it wasn't a dead end at all.
It was really a turning point in their thinking.
First, it really solidified their commitment
to things like full workspace backup
and being able to suspend and resume environments.
So that became a must have.
Exactly.
It was non-negotiable.
But maybe more importantly, this experiment
made them really consider moving away from Kubernetes.
Trying to shoehorn these micro VMs into the Kubernetes world
made them realize that there might be a better way.
A way where they weren't constantly fighting
the limitations of the platform.
So it's like, those micro VMs were the gateway drug
to their Kubernetes exodus.
I like that analogy.
It's perfect.
They got a taste of something different.
And they realized maybe they didn't need Kubernetes
after all.
OK, so after all that experimenting,
what was their final move?
Did they find the solution they were searching for?
They did.
They built their own system called Gitpod Flex.
It's designed from the ground up to be like the perfect home
for developer environments.
Taking the best of what they learned
and leaving the Kubernetes baggage behind.
All right, so this is where it gets really interesting.
Tell me more about Gitpod Flex.
What makes it so special?
Well, it's not a complete rejection of Kubernetes, right?
They kept some of the core principles.
For example, declarative APIs are still
a core part of Gitpod Flex.
Remember all those YAML files in Kubernetes?
Yeah.
Defining your infrastructure as code.
Well, that's still there.
OK.
But in a more streamlined and targeted way.
So you still get those benefits of infrastructure as code
without all the complexity.
Right.
And they also kept the use of control theory
for resource management.
This basically means they're using fancy algorithms
to automatically adjust resource allocation based on what's
happening in real time.
OK.
Kind of like Kubernetes auto scaling,
but tailored for how developer environments actually behave.
Right.
So even though it sounds complex under the hood,
what does this mean for developers
who are using Gitpod Flex?
What's the experience like?
Well, one big plus is the seamless integration
with dev containers.
These are like pre-configured, self-contained developer
environments, all the tools, libraries, dependencies,
all bundled up for specific projects.
So it's like a recipe for your perfect developer environment,
just had code.
Exactly.
And Gitpod Flex makes it super easy to spin those up.
They've also really doubled down on self-hosting.
So remember, Gitpod used to offer a cloud and a self-managed
version.
And they said that the self-managed version, which
was heavily Kubernetes-based, was a real pain to support.
Right.
Well, with Gitpod Flex, self-hosting is super easy.
You can have it up and running in less than three minutes
on pretty much any infrastructure.
Three minutes?
That's faster than it takes to order a pizza.
It really is.
And that opens up a lot of possibilities.
Companies can now run their developer environments closer
to their data, even on premises if they need to.
Gives them more control over security, compliance, all
that stuff.
So flexibility and control are really key here.
But what about performance?
All those Kubernetes headaches, the CPU throttling, storage
bottlenecks, all those things.
Have they managed to get rid of those with Gitpod Flex?
And that was one of their main goals.
And from what they've said, it seems
like they made a lot of progress.
By moving away from that shared kernel model of containers
and giving each environment its own dedicated resources,
they've managed to smooth out a lot of those performance
hiccups.
So each environment gets its own slice of the pie.
Exactly.
Now what about that memory snapshot feature
that they were so keen on with micro VMs?
Did that make it into Gitpod Flex?
So they haven't specifically said,
but knowing how much they care about making developer
environments stateful friendly, I
wouldn't be surprised if they're working on it.
Fingers crossed.
Right, because it fits perfectly with their vision.
OK, let's talk about security.
We know they put a ton of effort into securing
their Kubernetes setup.
Oh, yeah.
But it always felt like they were swimming upstream.
Right.
What's the story with Gitpod Flex?
Did they manage to make it simpler but also more secure?
Well, security is kind of baked into Gitpod Flex
from the very beginning.
They went all in on a zero trust architecture.
That basically means no user, no device, no request
is automatically trusted.
Everything has to be authenticated, authorized,
every step of the way.
Fort Knox for code.
Exactly.
This approach kind of avoids a lot of the vulnerabilities
they were dealing with in Kubernetes.
Right.
No more messing around with user namespaces or containers
breaking out of their isolation.
So more secure A and D, easier to manage.
That's the goal.
That's the dream.
Right.
And they've also made it much easier for companies
to apply their own security policies within Gitpod Flex.
So they can hook it into their existing identity management
systems.
They can really control who has access to what.
And they can monitor everything.
So they really put security front and center
from the beginning.
They did.
And it just shows how Gitpod Flex is really built for this.
It's not just about running code.
It's about creating this space where developers
can be productive, collaborative, and secure.
So after this whole journey, what's
the big takeaway here?
What can we learn from their experience?
Welcome back to the Deep Dive.
We've been talking all about Gitpod's journey,
from Kubernetes lovers to creating Gitpod Flex,
their own custom system.
Yeah, it shows that sometimes the most popular solution
isn't always the right one.
They realized Kubernetes just wasn't the right tool
for what they needed.
And they had the guts to go and do their own thing.
Exactly.
So in this final part, let's kind
of dig into what makes Gitpod Flex tick.
What were some of the architectural decisions
they made?
What are the features that really set it apart?
So one of the first things to understand
is that it's not a total rejection of Kubernetes.
They kept some of the core principles.
For example, declarative APIs are still
a big part of Gitpod Flex.
Remember all that YAML configuration
we talked about in Kubernetes?
That approach is still there, but it's a lot more streamlined,
more focused.
So you're still defining your infrastructure as code
without all that Kubernetes baggage.
Exactly.
And they also kept the use of control theory
for resource management.
Basically, this means that they're using these smart
algorithms to automatically adjust resource allocation
based on what's needed in real time,
kind of like Kubernetes auto-scaling, but again,
tailored for developer environments.
Right.
So even though it might sound kind of complex under the hood,
what does it mean for developers who are actually using Gitpod
Flex?
Well, one big benefit is the seamless integration
with dev containers.
These are basically like pre-configured, self-contained
developer environments.
You've got all your tools, libraries, dependencies,
all bundled together for specific projects.
So it's like a recipe for your perfect developer environment.
You just add code.
Exactly.
And Gitpod Flex makes it super easy to just spin those up.
And remember how they were struggling
with self-hosting their platform on Kubernetes?
Yeah.
With Gitpod Flex, self-hosting is incredibly easy.
You can have it up and running in under three minutes
on pretty much any infrastructure.
Three minutes.
That's faster than making a cup of coffee.
Pretty much.
And that opens up a lot of possibilities.
Companies can run their developer environments
closer to their data, even on premises, if they need to.
Gives them more control over security, compliance,
all that good stuff.
So flexibility and control are key here.
What about performance?
They had all those struggles with Kubernetes, CPU
throttling, storage bottlenecks, all those things.
Did they manage to fix those with Gitpod Flex?
That was definitely a top priority for them.
And it seems like they've made some major progress.
By ditching the whole shared kernel model of containers
and giving each environment its own dedicated resources,
they've managed to smooth out a lot of those performance issues.
So no more fighting over resources.
Right.
Every environment gets its own slice of the pie.
Now, what about that memory snapshot feature
that they were so excited about during the micro VM phase?
You know, the one where you could just pause and resume
your entire environment in a snap?
Did that make it into Gitpod Flex?
They haven't explicitly said, but I
wouldn't be surprised if they found a way to make it work.
It really aligns with their goal of making a system that's
truly developer friendly.
Fingers crossed.
OK, let's talk about security.
We know that they put a ton of effort
into securing their Kubernetes setup,
but it felt like they were constantly
fighting an uphill battle.
What's the security story with Gitpod Flex?
Well, security is a core part of Gitpod Flex.
They decided to go all in on a zero trust architecture, which
means that nothing is automatically trusted.
Every user, every device, every request
has to be authenticated and authorized
every step of the way.
So it's like Fort Knox for your code.
Exactly.
And this approach kind of eliminates
a lot of those vulnerabilities that they were always
struggling with in Kubernetes.
No more complex user namespaces or containers breaking out
of their isolation.
So more secure and easier to manage.
It sounds almost too good to be true.
Well, it shows what's possible when
you build a system that's designed for these requirements
from the ground up.
They've also made it a lot easier for companies
to integrate their own security policies into Gitpod Flex,
connecting it with their existing identity management
systems, setting fine grained access controls,
monitoring everything in real time.
So they're giving companies the tools
they need to make sure that everything's locked down.
Exactly.
And this really highlights what Gitpod Flex is all about.
It's not just a platform to run code.
It's an environment that's built to support developers.
A place where they can be productive,
they can be collaborative, and most importantly, secure.
So after this whole journey, what's the big takeaway?
What can we learn from their experience?
I think it's a reminder that sometimes you
have to go against the grain.
The most popular solution isn't always the best, right?
It's about understanding what you need, what your goals are,
and then finding the tools that fit,
even if it means building something yourself.
It's a story about challenging assumptions
and being willing to experiment and having the courage
to try something new when the old way just isn't working.
It really is.
And it makes you wonder, in our own work,
are we forcing tools into roles they weren't meant for?
Are there other systems out there
that could benefit from a similar rethink,
like what Gitpod did?
That's a great question for all of us to think about.
This has been a really interesting deep dive exploring
developer environments and how Gitpod
built this innovative solution.
In this world of technology that's always changing,
being willing to adapt, to experiment,
to break away from the norm, well,
that can lead to some amazing breakthroughs.
Thanks for joining us on the deep dive.
Thanks for joining us on the deep dive.