Today's Deep-Dive: Checkmk

0:00

Welcome back to the deep dive. We're here again taking a big stack of sources on

0:03

some pretty complex tech and really trying to boil it down to what you actually

0:07

need to know today. We are diving into the world of IT monitoring, which let's be

0:11

honest for a lot of folks. It's maybe the biggest source of operational anxiety out

0:15

there.

0:16

We're zooming in on the latest release of a major system called Check MK,

0:19

specifically version 2.4. And our mission really is to figure out how this update

0:24

makes monitoring modern, complex, often cloud-based infrastructures, well, simpler.

0:29

Especially if you're kind of new to this whole area, because monitoring today's IT,

0:33

I mean, things change constantly, hourly sometimes. It can feel like trying to map

0:36

a river delta while the water's still carving new paths. So the goal is

0:40

accessibility. How does Check MK 2.4 actually help IT pros stay in control?

0:46

You know, control of cloud containers, microservices, all that stuff without just

0:49

drowning in configuration details.

0:50

Okay, let's unpack this. But just before we get into the weeds of dynamic

0:54

environments and all that, a quick word about the folks who make this deep dive

0:56

possible.

0:57

This program gets support from Safe Server.

0:58

They specialize in hosting software, and really importantly, helping companies with

1:02

their digital transformation.

1:04

So if you're looking to deploy or manage or maybe scale up your monitoring solution,

1:09

or if you just need some expert advice on moving your infrastructure, they're the

1:12

people to talk to.

1:13

You can find out more about what they offer at www.safeserver.de. Again, that's www.safeserver.de.

1:21

Yeah, and historically, Checkmank really built its name as this comprehensive and

1:26

super reliable IT monitoring system.

1:28

People like it because it scales really well, it's flexible, and maybe a huge plus

1:32

for anyone running big systems.

1:33

It uses very few resources, low consumption, and it covers everything, you know,

1:38

from the physical servers or virtual machines right up to the applications

1:42

running on top. But like you said, the big challenge now isn't just monitoring

1:46

static stuff, it's managing constant change, this volatility.

1:49

The IT world is just fast and fluid now. So version 2.4, it's really engineered

1:54

specifically to tackle those modern headaches, speed and complexity.

1:58

The idea is to streamline things, so you spend less time messing with the tool and

2:02

more time actually looking at what the data means for the business.

2:05

OK, let's start with that immediate pain point, especially for someone new to

2:09

monitoring or maybe just new to the cloud itself. Setup.

2:12

If you've ever tried setting up monitoring for AWS or Azure or GCP, it just feels

2:17

like this maze of manual steps.

2:19

You're setting permissions, wrestling with service accounts. It's often where

2:23

beginners just get completely stuck.

2:25

So if I'm listening right now, maybe dealing with cloud scale, what's the instant

2:30

relief CheckMJ 2.4 offers for that setup nightmare?

2:34

Right. It's a feature called a quick setup. And honestly, it completely changes the

2:38

game for onboarding cloud services.

2:40

Instead of spending, you know, hours digging through docs and manually clicking

2:44

through dozens of settings, quick setup basically turns that whole complex mess

2:48

into something you can configure in like minutes in minutes.

2:51

Yeah. It's a huge step forward for simplification because it automatically sets up

2:56

monitoring for some really complex cloud services, the ones that are usually a

3:00

massive headache.

3:01

We're talking things like Azure SQL databases, serverless functions, or those

3:04

managed Kubernetes services, you know, AKS on Azure, EKS on Amazon, GKE on Google.

3:09

Whoa, hold on a sec. You just rattled off AKS, EKS and GKE. For someone who knows

3:15

basic servers, but maybe isn't deep into cloud native stuff yet, why is monitoring

3:21

those specifically so tricky and how does quick setup actually handle that?

3:25

Well, the tricky part is the abstraction, right? With a managed service, you don't

3:29

see the servers underneath. You can't just pop an agent on there. You have to talk

3:33

to these specific cloud APIs to get performance data out and each cloud provider,

3:38

each service does it a bit differently.

3:40

Quick setup basically acts as the translator in the configuration wizard. It

3:45

automatically knows how to fetch the right API keys, apply the correct permissions,

3:49

find the right metric endpoints. It handles all that behind the scenes.

3:53

And crucially this is really important. It has built in connection tests and validations.

3:57

So when you click save and quick setup, you're not just hoping it works. The system

4:02

actually confirms the connection is live and that data is flowing properly. It gets

4:06

rid of that horrible uncertainty you usually have with initial cloud setups. You

4:09

get reliable visibility like right away.

4:12

Okay, that tackles the getting started complexity. That's huge. But monitoring isn't

4:17

just set it and forget it, is it? What about the volatility? We see these

4:21

environments now may be heavily virtualized or definitely containers and Kubernetes

4:25

where hosts are just constantly popping up, scaling, disappearing. If your

4:30

monitoring needs someone to manually update it every single time a new service

4:33

spins up, well, you're just always playing catch up. So what's the answer for

4:37

managing that constant real time change?

4:39

Yeah, this is where the let's say the core architecture of 2.4 really comes into

4:44

its own through its enhanced and expanded dynamic host management. This feature is

4:48

absolutely key to handling that volatility, making sure your monitoring map

4:53

actually reflects reality second by second.

4:56

Put simply, CheckMek, it just automatically adapts to your infrastructure in real

5:00

time. It hooks into things like Kubernetes or vCenter or cloud APIs, and it detects

5:05

changes the moment they happen. So if a Kubernetes pod scales up or a new VM gets

5:10

provisioned, CheckMek just automatically adds it to monitoring. And just as

5:14

importantly, when that host vanishes or gets decommissioned, CheckMek reliably

5:18

removes it. It basically eliminates the need for a human to step in.

5:22

OK, that sounds brilliant for cutting down on alert noise, not monitoring hosts

5:27

that aren't even there anymore.

5:28

That gives you a much cleaner picture.

5:30

But hang on, if we're talking thousands of hosts changing per minute, which is

5:33

totally realistic in big cloud setups,

5:35

how does CheckMek deal with that essential problem of configuration drift?

5:39

How do you make sure the monitoring settings stay consistent across all those

5:42

temporary hosts?

5:43

That's a really good and necessary question.

5:46

It works using automation templates or rules.

5:49

You define rules based on host characteristics, maybe something like all hosts

5:53

tagged Kubernetes front-end

5:54

or all Windows servers in production.

5:56

And CheckMek then ensures that the right monitoring checks and configurations get

6:01

applied automatically

6:02

and consistently the instant that host is detected.

6:05

It's this templating engine that allows it to handle that massive scale you

6:08

mentioned,

6:09

thousands of changes per minute, while keeping everything stable and consistent,

6:12

something you just couldn't possibly do manually.

6:15

And the impact is pretty profound, right?

6:17

By automating all that host lifecycle stuff, your IT team stops being constantly

6:22

reactive,

6:23

you know, chasing down config errors, manually adding things.

6:26

They become proactive, focusing just on the critical performance data the system

6:30

brings to the surface.

6:31

OK, so we know the server or the container or whatever is there, and we know it's

6:35

running.

6:36

But that's often not enough these days, is it?

6:38

Here's where it gets really interesting.

6:40

We need to shift the conversation, right?

6:42

From just infrastructure health, like, is the CPU OK, to actual application health.

6:48

Is the service doing what the user expects?

6:51

Is the customer happy?

6:53

And Checkmaking 2.4 seemed to tackle this with two main approaches.

6:56

Open telemetry for that deep internal view and synthetic monitoring for the user's

7:01

perspective.

7:02

Yeah, let's unpack open telemetry first, or OTEL, as people call it.

7:06

For a beginner, maybe think of OTEL as a kind of standardized way for applications

7:10

to talk about themselves.

7:11

It's about getting really deep, granular visibility into the inner workings of your

7:15

software.

7:16

Sometimes right down to individual function calls or database queries.

7:20

So a lot of modern applications are now built to basically export OTEL data, or

7:24

maybe Prometheus metrics, which are similar.

7:26

Checkmaking 2.4 includes an OTEL collector that can grab all this data, performance

7:30

metrics, traces, logs,

7:31

and it pulls it into the monitoring system, makes sense of it, and ties it back to

7:35

the right host or service.

7:36

And the benefit there is?

7:38

The benefit is huge. Instead of just getting a generic alert like ServiceX is down,

7:42

OTEL data can help you pinpoint exactly where the problem is.

7:45

Maybe a latency spike happened in one specific microservice or when calling an

7:50

external API.

7:51

You're monitoring performance, reliability, even root causes using data coming

7:55

directly from inside the application code itself.

7:58

That can slash troubleshooting time dramatically.

8:02

Now, there is a really crucial point here, a bit of a caveat we absolutely have to

8:06

mention.

8:06

For anyone thinking about rolling this out right now, the integrated OpenTelemetry

8:11

Collector in CheckMex 2.4,

8:14

it's currently marked as an experimental beta feature.

8:16

Oh, OK. Good to know.

8:17

Yeah. So while you should definitely test it, play with it, use it for evaluation,

8:22

maybe in non-critical spots,

8:23

it's not officially supported yet for like mission-critical production systems you

8:27

rely on heavily.

8:28

It's just about managing expectations while the feature matures.

8:31

Still very powerful, but beta means beta.

8:34

Got it. Transparency is key there.

8:36

So that internal view is great for developers and ops.

8:41

But as we all know, sometimes what the application thinks is happening and what the

8:46

user is actually experiencing,

8:49

well, they can be worlds apart.

8:51

How do we make sure things are functionally correct from the outside in from the

8:55

user's viewpoint?

8:56

Right. And that's exactly where synthetic monitoring comes in.

8:59

This system uses automated scripts. People often call them robots that act like

9:03

real users.

9:04

They simulate critical business processes, things like logging into a website,

9:08

searching the product catalog, adding something to a cart, maybe completing the

9:11

checkout.

9:12

It's essentially real end-to-end testing that verifies the application actually

9:16

works as expected under real world conditions.

9:18

Setting up those test robots, deploying them across different machines, that can

9:22

sometimes be a bit clunky, right?

9:23

Especially if you have Windows and Linux systems. Did 2.4 make that easier?

9:27

Yes, definitely. Check MNK 2.4. Really streamline that whole deployment pipeline.

9:32

Users can now upload their automated test scripts, maybe robot framework scripts,

9:36

or similar directly through the main Check MNK web interface.

9:40

So management is centralized and deploying them across all your different systems,

9:44

Linux and Windows, is handled really easily using something called the Agent Bakery.

9:49

The Agent Bakery, what's that?

9:50

The Agent Bakery is basically Check MNK's really powerful tool for building

9:55

customized monitoring agents.

9:57

Think of it like this. It packages up the standard Check NK agent software, but it

10:01

also bundles in any custom configurations,

10:03

plugins you need, in this case, all the necessary bits and pieces to actually run

10:07

those synthetic monitoring test scripts.

10:09

It ensures you get a consistent deployment everywhere without manual installs on

10:14

each machine.

10:14

In this setup, it also solves a really common problem in, say, highly secure or

10:18

isolated environments like internal dev networks or test labs that might not have

10:23

Internet access.

10:23

Oh, interesting. How so?

10:25

Well, CheckMXX supports running these synthetic tests on isolated nodes in a couple

10:29

of ways.

10:30

You can either package the whole test environment up as a ZIP file and upload that,

10:35

or you can manage the test execution using something called an RCC server.

10:40

The RCC server is basically a self-contained runtime environment provided by the

10:44

monitoring system.

10:45

It makes sure those test scripts run reliably, even if the machine they're on is

10:49

completely disconnected from the outside world.

10:51

OK, that's clever. And I find the way it handles the results really compelling.

10:55

You mentioned being able to track specific steps within a test,

10:58

like verifying if a support ticket was submitted successfully and turning that

11:02

specific step into a status indicator on a dashboard.

11:05

Exactly. That's a fundamental shift in how you report on this stuff.

11:09

Instead of just getting a vague alert saying login script failed,

11:13

you can configure CheckMXX to look for specific keywords or actions within the

11:17

script's output.

11:19

So if the script successfully completes the add item to cart step, maybe that step

11:23

becomes a green light on your dashboard.

11:25

But if it fails later at the process payment stage, that specific action turns red.

11:30

It translates these complex script based outcomes into really simple, easy to

11:35

understand status lights.

11:37

Something management can grasp instantly. It's very powerful.

11:40

OK, so let's recap the big innovations we've had, especially for someone starting

11:44

out.

11:44

We've got minute level cloud setup with quick setup.

11:47

We've got automated dynamic host management handling all that infrastructure churn.

11:51

And then deep end-to-end application visibility using both open telemetry and

11:55

synthetic monitoring.

11:57

So what does this all mean if I'm new to CheckMAC, maybe new to serious monitoring

12:01

in general,

12:01

and I want to start using this tech? How do I pick my starting point? What are the

12:04

options?

12:05

Well, the best news probably for anyone looking for an entry point is CheckMKRAW.

12:10

And we really need to emphasize this. CheckMKRAW is completely free and open source.

12:15

It's under the GNU GPL v2 license. This is absolutely your foundation, the ideal

12:20

place to start.

12:21

Free and open source. So what do you get with that?

12:24

You get a lot. RAW includes the core check and pay monitoring engine, the full web

12:28

interface we've been talking about,

12:30

support for both agent-based monitoring and agentless methods like SNMP or API

12:35

checks,

12:36

and you get immediate access to hundreds, literally hundreds, of official and

12:40

community-created plugins

12:41

for monitoring all sorts of devices and applications.

12:43

For smaller setups, maybe internal labs, or if your requirements are fairly basic,

12:48

RAW gives you incredible monitoring power for absolutely zero cost.

12:51

Okay, so RAW is the free starting point.

12:52

Yeah.

12:53

What about the other editions then, just for context?

12:55

Right. So the commercial editions, Checkmak Enterprise and Checkmak Cloud, they

13:00

basically build on top of RAW.

13:02

They add features aimed more at large scale, maybe regulated industries or complex

13:06

organizational needs.

13:08

For instance, if you need true distributed monitoring, managing multiple Checkmak

13:12

sites across different locations from one central point,

13:15

or if you need specific dashboards tailored for business managers or guaranteed

13:19

professional support,

13:20

that's where you'd look at Enterprise.

13:22

And the cloud edition, which you can run self-hosted or get as a sauce offering,

13:26

is specifically tuned for those super dynamic, ephemeral cloud-native environments

13:31

we talked about earlier.

13:32

It has specialized features for monitoring things at extreme scale.

13:35

So the choice usually comes down to complexity, scale, and your need for support.

13:39

But that solid foundation of reliable monitoring, that starts with raw, which

13:43

anyone can just download and install today.

13:46

And installation itself is pretty flexible, too.

13:48

Standard Linux packages for various distros, Docker containers, even ready-to-go

13:52

virtual or physical appliances.

13:54

That flexibility really lowers the barrier to entry, doesn't it?

13:57

You can try the raw edition free, open source, or I think they even have online

14:00

demos you can play with

14:01

without installing anything at all, just to kick the tires.

14:04

So today, we really dove into Checkmk 2.4.

14:08

We saw how its focus on simplification aims to tackle some of the biggest

14:11

challenges in modern IT.

14:12

You get that super fast cloud setup, automatic management of dynamic infrastructure,

14:17

and that crucial end-to-end view of application health.

14:21

Yeah, and maybe just to leave you with a final thought to chew on, something

14:24

provocative, perhaps.

14:25

Given the level of automation we're talking about now, you know,

14:29

Checkmk 2.4 handling potentially thousands of infrastructure changes every minute,

14:35

automatically.

14:36

Just consider what that really means for the day-to-day work of an IT professional.

14:40

This kind of automation fundamentally shifts the role away from tedious, repetitive

14:44

tasks.

14:45

Things like manually registering hosts, updating configurations, chasing down minor

14:49

changes.

14:50

The human value gets elevated.

14:51

You're freed up to focus almost entirely on interpreting the critical performance

14:55

trends,

14:56

analyzing the data, and using those insights to drive business efficiency,

14:59

rather than just, you know, keeping the lights on.

15:01

That transition from being primarily an operator to becoming more of an analyst.

15:05

That, I think, is the true underlying impact of what 2.4 enables.

15:08

That's a really powerful perspective on the future of the IT role.

15:12

Excellent point.

15:13

Well, thank you for guiding us through that deep dive today.

15:15

And we absolutely want to thank our sponsor one last time, Safe Server, for making

15:19

this program possible.

15:21

Remember, you can find out more about their hosting and digital transformation

15:25

services

15:25

www.safeserver.de.

15:25

www.safeserver.de.

Today's Deep-Dive: Checkmk

Episode description

Persons