Today's Deep-Dive: Checkmk
Ep. 260

Today's Deep-Dive: Checkmk

Episode description

This deep dive discusses the latest release of Checkmk, version 2.4, focusing on its capabilities in simplifying IT monitoring for modern, complex, and often cloud-based infrastructures. The update introduces features like Quick Setup for easy cloud service monitoring and dynamic host management to handle the constant changes in IT environments. It also includes OpenTelemetry for deep application insights and synthetic monitoring for user experience verification. Checkmk RAW, a free and open-source version, is highlighted as an ideal starting point for those new to serious monitoring. The document emphasizes the shift in IT roles from operational tasks to analytical ones, enabled by the automation features of Checkmk 2.4.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now for 1 Euro - 30 days free!

Download transcript (.srt)
0:00

Welcome back to the deep dive. We're here again taking a big stack of sources on

0:03

some pretty complex tech and really trying to boil it down to what you actually

0:07

need to know today. We are diving into the world of IT monitoring, which let's be

0:11

honest for a lot of folks. It's maybe the biggest source of operational anxiety out

0:15

there.

0:16

We're zooming in on the latest release of a major system called Check MK,

0:19

specifically version 2.4. And our mission really is to figure out how this update

0:24

makes monitoring modern, complex, often cloud-based infrastructures, well, simpler.

0:29

Especially if you're kind of new to this whole area, because monitoring today's IT,

0:33

I mean, things change constantly, hourly sometimes. It can feel like trying to map

0:36

a river delta while the water's still carving new paths. So the goal is

0:40

accessibility. How does Check MK 2.4 actually help IT pros stay in control?

0:46

You know, control of cloud containers, microservices, all that stuff without just

0:49

drowning in configuration details.

0:50

Okay, let's unpack this. But just before we get into the weeds of dynamic

0:54

environments and all that, a quick word about the folks who make this deep dive

0:56

possible.

0:57

This program gets support from Safe Server.

0:58

They specialize in hosting software, and really importantly, helping companies with

1:02

their digital transformation.

1:04

So if you're looking to deploy or manage or maybe scale up your monitoring solution,

1:09

or if you just need some expert advice on moving your infrastructure, they're the

1:12

people to talk to.

1:13

You can find out more about what they offer at www.safeserver.de. Again, that's www.safeserver.de.

1:21

Yeah, and historically, Checkmank really built its name as this comprehensive and

1:26

super reliable IT monitoring system.

1:28

People like it because it scales really well, it's flexible, and maybe a huge plus

1:32

for anyone running big systems.

1:33

It uses very few resources, low consumption, and it covers everything, you know,

1:38

from the physical servers or virtual machines right up to the applications

1:42

running on top. But like you said, the big challenge now isn't just monitoring

1:46

static stuff, it's managing constant change, this volatility.

1:49

The IT world is just fast and fluid now. So version 2.4, it's really engineered

1:54

specifically to tackle those modern headaches, speed and complexity.

1:58

The idea is to streamline things, so you spend less time messing with the tool and

2:02

more time actually looking at what the data means for the business.

2:05

OK, let's start with that immediate pain point, especially for someone new to

2:09

monitoring or maybe just new to the cloud itself. Setup.

2:12

If you've ever tried setting up monitoring for AWS or Azure or GCP, it just feels

2:17

like this maze of manual steps.

2:19

You're setting permissions, wrestling with service accounts. It's often where

2:23

beginners just get completely stuck.

2:25

So if I'm listening right now, maybe dealing with cloud scale, what's the instant

2:30

relief CheckMJ 2.4 offers for that setup nightmare?

2:34

Right. It's a feature called a quick setup. And honestly, it completely changes the

2:38

game for onboarding cloud services.

2:40

Instead of spending, you know, hours digging through docs and manually clicking

2:44

through dozens of settings, quick setup basically turns that whole complex mess

2:48

into something you can configure in like minutes in minutes.

2:51

Yeah. It's a huge step forward for simplification because it automatically sets up

2:56

monitoring for some really complex cloud services, the ones that are usually a

3:00

massive headache.

3:01

We're talking things like Azure SQL databases, serverless functions, or those

3:04

managed Kubernetes services, you know, AKS on Azure, EKS on Amazon, GKE on Google.

3:09

Whoa, hold on a sec. You just rattled off AKS, EKS and GKE. For someone who knows

3:15

basic servers, but maybe isn't deep into cloud native stuff yet, why is monitoring

3:21

those specifically so tricky and how does quick setup actually handle that?

3:25

Well, the tricky part is the abstraction, right? With a managed service, you don't

3:29

see the servers underneath. You can't just pop an agent on there. You have to talk

3:33

to these specific cloud APIs to get performance data out and each cloud provider,

3:38

each service does it a bit differently.

3:40

Quick setup basically acts as the translator in the configuration wizard. It

3:45

automatically knows how to fetch the right API keys, apply the correct permissions,

3:49

find the right metric endpoints. It handles all that behind the scenes.

3:53

And crucially this is really important. It has built in connection tests and validations.

3:57

So when you click save and quick setup, you're not just hoping it works. The system

4:02

actually confirms the connection is live and that data is flowing properly. It gets

4:06

rid of that horrible uncertainty you usually have with initial cloud setups. You

4:09

get reliable visibility like right away.

4:12

Okay, that tackles the getting started complexity. That's huge. But monitoring isn't

4:17

just set it and forget it, is it? What about the volatility? We see these

4:21

environments now may be heavily virtualized or definitely containers and Kubernetes

4:25

where hosts are just constantly popping up, scaling, disappearing. If your

4:30

monitoring needs someone to manually update it every single time a new service

4:33

spins up, well, you're just always playing catch up. So what's the answer for

4:37

managing that constant real time change?

4:39

Yeah, this is where the let's say the core architecture of 2.4 really comes into

4:44

its own through its enhanced and expanded dynamic host management. This feature is

4:48

absolutely key to handling that volatility, making sure your monitoring map

4:53

actually reflects reality second by second.

4:56

Put simply, CheckMek, it just automatically adapts to your infrastructure in real

5:00

time. It hooks into things like Kubernetes or vCenter or cloud APIs, and it detects

5:05

changes the moment they happen. So if a Kubernetes pod scales up or a new VM gets

5:10

provisioned, CheckMek just automatically adds it to monitoring. And just as

5:14

importantly, when that host vanishes or gets decommissioned, CheckMek reliably

5:18

removes it. It basically eliminates the need for a human to step in.

5:22

OK, that sounds brilliant for cutting down on alert noise, not monitoring hosts

5:27

that aren't even there anymore.

5:28

That gives you a much cleaner picture.

5:30

But hang on, if we're talking thousands of hosts changing per minute, which is

5:33

totally realistic in big cloud setups,

5:35

how does CheckMek deal with that essential problem of configuration drift?

5:39

How do you make sure the monitoring settings stay consistent across all those

5:42

temporary hosts?

5:43

That's a really good and necessary question.

5:46

It works using automation templates or rules.

5:49

You define rules based on host characteristics, maybe something like all hosts

5:53

tagged Kubernetes front-end

5:54

or all Windows servers in production.

5:56

And CheckMek then ensures that the right monitoring checks and configurations get

6:01

applied automatically

6:02

and consistently the instant that host is detected.

6:05

It's this templating engine that allows it to handle that massive scale you

6:08

mentioned,

6:09

thousands of changes per minute, while keeping everything stable and consistent,

6:12

something you just couldn't possibly do manually.

6:15

And the impact is pretty profound, right?

6:17

By automating all that host lifecycle stuff, your IT team stops being constantly

6:22

reactive,

6:23

you know, chasing down config errors, manually adding things.

6:26

They become proactive, focusing just on the critical performance data the system

6:30

brings to the surface.

6:31

OK, so we know the server or the container or whatever is there, and we know it's

6:35

running.

6:36

But that's often not enough these days, is it?

6:38

Here's where it gets really interesting.

6:40

We need to shift the conversation, right?

6:42

From just infrastructure health, like, is the CPU OK, to actual application health.

6:48

Is the service doing what the user expects?

6:51

Is the customer happy?

6:53

And Checkmaking 2.4 seemed to tackle this with two main approaches.

6:56

Open telemetry for that deep internal view and synthetic monitoring for the user's

7:01

perspective.

7:02

Yeah, let's unpack open telemetry first, or OTEL, as people call it.

7:06

For a beginner, maybe think of OTEL as a kind of standardized way for applications

7:10

to talk about themselves.

7:11

It's about getting really deep, granular visibility into the inner workings of your

7:15

software.

7:16

Sometimes right down to individual function calls or database queries.

7:20

So a lot of modern applications are now built to basically export OTEL data, or

7:24

maybe Prometheus metrics, which are similar.

7:26

Checkmaking 2.4 includes an OTEL collector that can grab all this data, performance

7:30

metrics, traces, logs,

7:31

and it pulls it into the monitoring system, makes sense of it, and ties it back to

7:35

the right host or service.

7:36

And the benefit there is?

7:38

The benefit is huge. Instead of just getting a generic alert like ServiceX is down,

7:42

OTEL data can help you pinpoint exactly where the problem is.

7:45

Maybe a latency spike happened in one specific microservice or when calling an

7:50

external API.

7:51

You're monitoring performance, reliability, even root causes using data coming

7:55

directly from inside the application code itself.

7:58

That can slash troubleshooting time dramatically.

8:02

Now, there is a really crucial point here, a bit of a caveat we absolutely have to

8:06

mention.

8:06

For anyone thinking about rolling this out right now, the integrated OpenTelemetry

8:11

Collector in CheckMex 2.4,

8:14

it's currently marked as an experimental beta feature.

8:16

Oh, OK. Good to know.

8:17

Yeah. So while you should definitely test it, play with it, use it for evaluation,

8:22

maybe in non-critical spots,

8:23

it's not officially supported yet for like mission-critical production systems you

8:27

rely on heavily.

8:28

It's just about managing expectations while the feature matures.

8:31

Still very powerful, but beta means beta.

8:34

Got it. Transparency is key there.

8:36

So that internal view is great for developers and ops.

8:41

But as we all know, sometimes what the application thinks is happening and what the

8:46

user is actually experiencing,

8:49

well, they can be worlds apart.

8:51

How do we make sure things are functionally correct from the outside in from the

8:55

user's viewpoint?

8:56

Right. And that's exactly where synthetic monitoring comes in.

8:59

This system uses automated scripts. People often call them robots that act like

9:03

real users.

9:04

They simulate critical business processes, things like logging into a website,

9:08

searching the product catalog, adding something to a cart, maybe completing the

9:11

checkout.

9:12

It's essentially real end-to-end testing that verifies the application actually

9:16

works as expected under real world conditions.

9:18

Setting up those test robots, deploying them across different machines, that can

9:22

sometimes be a bit clunky, right?

9:23

Especially if you have Windows and Linux systems. Did 2.4 make that easier?

9:27

Yes, definitely. Check MNK 2.4. Really streamline that whole deployment pipeline.

9:32

Users can now upload their automated test scripts, maybe robot framework scripts,

9:36

or similar directly through the main Check MNK web interface.

9:40

So management is centralized and deploying them across all your different systems,

9:44

Linux and Windows, is handled really easily using something called the Agent Bakery.

9:49

The Agent Bakery, what's that?

9:50

The Agent Bakery is basically Check MNK's really powerful tool for building

9:55

customized monitoring agents.

9:57

Think of it like this. It packages up the standard Check NK agent software, but it

10:01

also bundles in any custom configurations,

10:03

plugins you need, in this case, all the necessary bits and pieces to actually run

10:07

those synthetic monitoring test scripts.

10:09

It ensures you get a consistent deployment everywhere without manual installs on

10:14

each machine.

10:14

In this setup, it also solves a really common problem in, say, highly secure or

10:18

isolated environments like internal dev networks or test labs that might not have

10:23

Internet access.

10:23

Oh, interesting. How so?

10:25

Well, CheckMXX supports running these synthetic tests on isolated nodes in a couple

10:29

of ways.

10:30

You can either package the whole test environment up as a ZIP file and upload that,

10:35

or you can manage the test execution using something called an RCC server.

10:40

The RCC server is basically a self-contained runtime environment provided by the

10:44

monitoring system.

10:45

It makes sure those test scripts run reliably, even if the machine they're on is

10:49

completely disconnected from the outside world.

10:51

OK, that's clever. And I find the way it handles the results really compelling.

10:55

You mentioned being able to track specific steps within a test,

10:58

like verifying if a support ticket was submitted successfully and turning that

11:02

specific step into a status indicator on a dashboard.

11:05

Exactly. That's a fundamental shift in how you report on this stuff.

11:09

Instead of just getting a vague alert saying login script failed,

11:13

you can configure CheckMXX to look for specific keywords or actions within the

11:17

script's output.

11:19

So if the script successfully completes the add item to cart step, maybe that step

11:23

becomes a green light on your dashboard.

11:25

But if it fails later at the process payment stage, that specific action turns red.

11:30

It translates these complex script based outcomes into really simple, easy to

11:35

understand status lights.

11:37

Something management can grasp instantly. It's very powerful.

11:40

OK, so let's recap the big innovations we've had, especially for someone starting

11:44

out.

11:44

We've got minute level cloud setup with quick setup.

11:47

We've got automated dynamic host management handling all that infrastructure churn.

11:51

And then deep end-to-end application visibility using both open telemetry and

11:55

synthetic monitoring.

11:57

So what does this all mean if I'm new to CheckMAC, maybe new to serious monitoring

12:01

in general,

12:01

and I want to start using this tech? How do I pick my starting point? What are the

12:04

options?

12:05

Well, the best news probably for anyone looking for an entry point is CheckMKRAW.

12:10

And we really need to emphasize this. CheckMKRAW is completely free and open source.

12:15

It's under the GNU GPL v2 license. This is absolutely your foundation, the ideal

12:20

place to start.

12:21

Free and open source. So what do you get with that?

12:24

You get a lot. RAW includes the core check and pay monitoring engine, the full web

12:28

interface we've been talking about,

12:30

support for both agent-based monitoring and agentless methods like SNMP or API

12:35

checks,

12:36

and you get immediate access to hundreds, literally hundreds, of official and

12:40

community-created plugins

12:41

for monitoring all sorts of devices and applications.

12:43

For smaller setups, maybe internal labs, or if your requirements are fairly basic,

12:48

RAW gives you incredible monitoring power for absolutely zero cost.

12:51

Okay, so RAW is the free starting point.

12:52

Yeah.

12:53

What about the other editions then, just for context?

12:55

Right. So the commercial editions, Checkmak Enterprise and Checkmak Cloud, they

13:00

basically build on top of RAW.

13:02

They add features aimed more at large scale, maybe regulated industries or complex

13:06

organizational needs.

13:08

For instance, if you need true distributed monitoring, managing multiple Checkmak

13:12

sites across different locations from one central point,

13:15

or if you need specific dashboards tailored for business managers or guaranteed

13:19

professional support,

13:20

that's where you'd look at Enterprise.

13:22

And the cloud edition, which you can run self-hosted or get as a sauce offering,

13:26

is specifically tuned for those super dynamic, ephemeral cloud-native environments

13:31

we talked about earlier.

13:32

It has specialized features for monitoring things at extreme scale.

13:35

So the choice usually comes down to complexity, scale, and your need for support.

13:39

But that solid foundation of reliable monitoring, that starts with raw, which

13:43

anyone can just download and install today.

13:46

And installation itself is pretty flexible, too.

13:48

Standard Linux packages for various distros, Docker containers, even ready-to-go

13:52

virtual or physical appliances.

13:54

That flexibility really lowers the barrier to entry, doesn't it?

13:57

You can try the raw edition free, open source, or I think they even have online

14:00

demos you can play with

14:01

without installing anything at all, just to kick the tires.

14:04

So today, we really dove into Checkmk 2.4.

14:08

We saw how its focus on simplification aims to tackle some of the biggest

14:11

challenges in modern IT.

14:12

You get that super fast cloud setup, automatic management of dynamic infrastructure,

14:17

and that crucial end-to-end view of application health.

14:21

Yeah, and maybe just to leave you with a final thought to chew on, something

14:24

provocative, perhaps.

14:25

Given the level of automation we're talking about now, you know,

14:29

Checkmk 2.4 handling potentially thousands of infrastructure changes every minute,

14:35

automatically.

14:36

Just consider what that really means for the day-to-day work of an IT professional.

14:40

This kind of automation fundamentally shifts the role away from tedious, repetitive

14:44

tasks.

14:45

Things like manually registering hosts, updating configurations, chasing down minor

14:49

changes.

14:50

The human value gets elevated.

14:51

You're freed up to focus almost entirely on interpreting the critical performance

14:55

trends,

14:56

analyzing the data, and using those insights to drive business efficiency,

14:59

rather than just, you know, keeping the lights on.

15:01

That transition from being primarily an operator to becoming more of an analyst.

15:05

That, I think, is the true underlying impact of what 2.4 enables.

15:08

That's a really powerful perspective on the future of the IT role.

15:12

Excellent point.

15:13

Well, thank you for guiding us through that deep dive today.

15:15

And we absolutely want to thank our sponsor one last time, Safe Server, for making

15:19

this program possible.

15:21

Remember, you can find out more about their hosting and digital transformation

15:25

services

15:25

www.safeserver.de.

15:25

www.safeserver.de.