Today's Deep-Dive: Healthchecks

0:00

You know that specific feeling of dread

0:03

when you realize a system has been failing

0:05

but completely behind the scenes?

0:07

Not a big catastrophic crash

0:09

that throws up error messages everywhere,

0:11

but that slow, invisible rot.

0:14

A silent failure.

0:15

It's just devastating because nothing screams at you.

0:17

There are no red lights flashing.

0:19

Exactly.

0:19

So maybe that nightly database backup

0:22

just stopped running two weeks ago.

0:24

Or your script that calculates critical business metrics,

0:28

it just choked on some bad data and died.

0:32

Silently.

0:32

And that crucial task goes unnoticed for days,

0:35

weeks, maybe even months.

0:38

And then you discover you've lost weeks of data

0:40

or even worse, your SSL certificate has silently expired.

0:44

And it takes your whole website

0:45

down in the middle of the night.

0:46

We rely so heavily on these scheduled tasks,

0:48

cron jobs, system timers, you name it,

0:51

but we sort of operate on this dangerous assumption

0:53

that they're all still running happily.

0:55

We set them and we forget them.

0:57

We do.

0:57

So today we are tackling that vulnerability head on.

1:01

And our mission in this deep dive

1:02

is really for anyone who manages these kinds of tasks.

1:05

It doesn't matter if you're a beginner developer

1:07

or a seasoned sysadmin.

1:09

We're gonna extract the core concepts

1:12

behind proper background job monitoring.

1:15

And we're gonna use the architecture

1:17

of a service called healthchecks.io to do it,

1:20

making this whole thing really accessible.

1:21

The fundamental shift is turning a passive forgotten job

1:25

into an actively monitored system

1:28

instead of waiting for something bad to happen.

1:30

You make the job itself report its success.

1:33

And if that report doesn't show up on time,

1:35

that's when you sound the alarm.

1:36

Exactly.

1:37

But first, before we get into the nuts and bolts,

1:40

let's take a moment to thank the supporter

1:42

of this deep dive.

1:43

Absolutely.

1:43

Safe Server supports the hosting of this kind of software.

1:46

It can really assist with your digital transformation.

1:49

If you wanna know more about how they can help you,

1:51

you can find more information at www.safeserver.de.

1:56

A huge thank you to them.

1:58

Okay, so let's unpack this core idea,

2:00

this monitoring by exception,

2:01

or what they call the ping model.

2:04

It's so simple, and I think that's its biggest strength.

2:06

It really is.

2:07

It basically boils down to just three steps.

2:09

First, you generate a unique ping URL

2:13

for every single background job you care about.

2:16

So it's like a personalized digital doorbell

2:18

for that one specific task.

2:21

That's a great way to think about it.

2:22

And step two is where you go into your actual job,

2:25

your script, your code, whatever it is,

2:27

and you make the very last thing it does,

2:29

assuming everything ran successfully.

2:31

Is send a little HTTP request, a ping, to that unique URL.

2:36

Right, it's calling home and saying,

2:38

hey, I finished, everything's fine.

2:40

I'm okay.

2:41

And then step three is the monitoring systems part.

2:43

It's just sitting there waiting for that unique ping

2:45

to arrive within the time it expects.

2:47

If it does not get that ping on time, that's the exception.

2:50

And it knows something's wrong.

2:51

Either the job didn't start, it crashed,

2:53

it couldn't reach the internet,

2:55

whatever the reason, it sends you an alert.

2:57

And it's checking for silence,

2:58

not for specific error messages buried in log files.

3:02

Okay, but hold on.

3:03

What if my job runs perfectly,

3:05

but then my server's network connection drops

3:08

right before it sends the ping?

3:10

Isn't that just creating a false alarm?

3:12

That's an excellent, really critical question.

3:15

The failure we're catching here

3:16

is the job failing to complete

3:18

its entire operational lifecycle.

3:20

And that includes communication.

3:23

Ah.

3:23

So it ensures the whole chain is intact.

3:25

I mean, if the job runs,

3:27

but the server can't reach the outside world,

3:29

that is still a failure that demands your attention.

3:32

Right, because the next critical job

3:34

might also need to reach an update server or something.

3:36

Exactly.

3:37

It confirms your operational readiness end to end.

3:40

That makes a lot of sense.

3:41

And it makes this model just incredibly versatile.

3:43

We're talking about everything from, I don't know,

3:45

simple DNS updates to really complex metric calculations.

3:50

And what's great, especially for beginners or small teams,

3:52

is the low barrier to entry.

3:55

This platform offers a pretty generous free tier.

3:58

You can monitor 20 cron jobs for free.

4:00

You don't even need a credit card.

4:02

So you can get immediate control over your tasks.

4:05

OK, moving on from the basic ping.

4:07

Once you have your job calling home,

4:09

you need to configure when the system should actually

4:11

start to worry.

4:12

Right, you get this live updating dashboard.

4:14

You can name and tag all your checks

4:16

to keep things organized.

4:17

But the real insight comes from mastering

4:21

the two main parameters that govern the system's patients.

4:24

And those are the period and the grace time.

4:26

That's right.

4:27

So let's break those down.

4:28

The period seems pretty straightforward.

4:30

It's just the expected time between successful pings.

4:34

If my job runs every Tuesday at noon,

4:36

the period is seven days.

4:37

Exactly.

4:39

But in the real world, things can run a little late.

4:42

And that's where the grace time comes in.

4:43

It's the buffer.

4:44

It's the extra time you allow.

4:46

So if my payroll calculation normally takes, say, 30 minutes,

4:50

I might set the grace time to 90 minutes

4:52

just to account for, I don't know, high database

4:55

load or something.

4:55

Precisely.

4:56

You always want to set it slightly

4:57

above the longest you'd ever expect that job to take.

5:00

So the system isn't just panicking

5:02

the second the period expires.

5:04

It gives the job some room to breathe.

5:06

And these two parameters, they work together

5:08

to define four key states, which is, I think,

5:11

crucial for preventing alert fatigue.

5:13

Oh, absolutely.

5:14

So first, you have new.

5:16

The check was just created.

5:17

It hasn't heard anything yet.

5:18

It's full enough.

5:19

Then you have up, which means the last ping arrived

5:22

within the period.

5:23

Everything is healthy.

5:24

Everything is on schedule.

5:25

And then things get interesting.

5:26

We hit the pre-alert stage late.

5:29

So the time since the last ping has gone past the period,

5:33

but it's not yet past the period plus the grace time.

5:35

Correct.

5:36

It's delayed, but it's still inside that acceptable buffer

5:39

you defined.

5:39

And then finally, the alert state down.

5:42

The time has now exceeded the period plus the grace time.

5:46

And we should emphasize, the notification

5:48

is sent specifically when that check transitions

5:51

from late to down.

5:52

That one specific moment.

5:53

Yes.

5:54

And that mechanism is a lifesaver.

5:56

It means you only get paged when the delay is officially

5:59

a mission-critical failure.

6:01

It saves you from so many midnight alerts.

6:03

That is a fantastic way to maintain sanity.

6:06

It's not just it's late.

6:07

It is now officially too late.

6:09

Right.

6:10

Now as an alternative to just a simple period and grace time,

6:14

you can also use Kronexpression syntax.

6:16

Right.

6:16

Yes.

6:17

And you'd use that for more complex schedules.

6:19

Say you have a job that runs on the first Monday

6:22

of every quarter.

6:23

You could never define that with just period.

6:26

It would be impossible.

6:27

So Kronexpression syntax lets you define those irregular

6:29

schedules very precisely, telling the service

6:32

the exact times it should expect to hear from that job.

6:35

And beyond all the configuration,

6:37

you get a lot of transparency.

6:39

You can see a detailed event log of every ping

6:41

that's come in, every down notification that's gone out.

6:44

And they also have these things called status badges.

6:46

They're little graphics with these hard-to-guess URLs

6:49

that you can embed in, say, a project re-enemy file

6:53

or a public status page.

6:55

To show the live health of your tasks.

6:57

Yeah.

6:57

Pretty cool.

6:58

OK, so we've really nailed down the win and the logic.

7:02

Now let's talk about the what, the scope.

7:04

What can you actually monitor with this model?

7:06

It's so much broader than just the traditional Linux

7:08

cron system.

7:09

I mean, it's a perfect fit for a huge range

7:12

of scheduled environments.

7:13

So we're talking modern stuff, like Kubernetes cron jobs.

7:15

Yep.

7:16

And older systems like Windows schedule tasks,

7:19

build pipelines in Jenkins, Heroku scheduler,

7:22

even WordPress's Ropey car cron, which

7:24

is famous for failing silently and just wrecking sites.

7:28

So if it runs periodically, it's a target.

7:30

It is.

7:31

And the practical use cases are things that really

7:33

need guaranteed uptime.

7:34

We're talking file system and database backups generating

7:37

those daily or weekly report emails.

7:39

The really essential SSL certificate renewals.

7:42

Oh, absolutely.

7:43

And those vital business data import and sync jobs,

7:47

if any of those fail, the business

7:49

faces real consequences.

7:51

But it seems like the utility goes even beyond just

7:54

scheduled software tasks.

7:56

You can use this for really lightweight server health

7:58

checks, like a heartbeat for your infrastructure.

8:01

This is where the versatility of a simple HTTP really shines.

8:05

Instead of installing some massive monitoring

8:07

agent on a server, you can just write a tiny shell script.

8:10

And that script checks a condition.

8:12

Is a certain Docker container running?

8:14

Or do we have enough free disk space?

8:16

Exactly.

8:17

And if that check succeeds, the script just pings the URL.

8:20

If the whole server dies, well, the ping stops,

8:23

and you get an alert.

8:24

So you could check, if an application process is running,

8:27

you could monitor database replication lag,

8:30

or even just send simple I'm alive pings from a NAS box

8:34

or a Raspberry Pi.

8:35

Yeah, it gives you an easy central place

8:37

to see the health of things that might be scattered

8:40

all over the place physically.

8:41

And the real world impact of that is so clear.

8:43

I saw a great testimonial calling the service

8:46

an absolute lifesaver.

8:49

Oh, the IoT gateway one?

8:50

Yeah.

8:51

Someone was using it to monitor an IoT gateway.

8:53

And because they got a quick heads up

8:55

that it had gone offline, they were

8:57

able to save the device from literally being fried.

9:00

Right, because it had been accidentally

9:01

placed on top of a hot router while someone was cleaning.

9:03

That immediate notification prevented

9:06

a physical piece of hardware from being destroyed.

9:08

And that anecdote just perfectly illustrates the value, right?

9:11

Proactive monitoring over waiting for something

9:14

to actually break or catch fire.

9:16

For sure.

9:17

OK, so let's pivot to the ecosystem.

9:18

How do you make sure those alerts actually

9:20

turn into action?

9:22

That's all about integrations.

9:24

A notification is useless.

9:25

If it just lands in an inbox, you never check.

9:28

And this is where the platform really excels.

9:30

It has more than 25 integrations for different notification

9:32

channels.

9:34

The whole point is to make sure the alert finds

9:36

the right person at the right time.

9:37

So if my nightly backup fails, I probably

9:40

want that alert to pop up directly in the Slack channel

9:43

where my DevOps team lives, or maybe Microsoft teams.

9:46

But if it's a really high priority incident,

9:48

you want to integrate it with something like PagerDuty

9:51

or Ops Genie.

9:52

That guaranteed incident escalation.

9:54

Yeah, something that can actually trigger phone calls

9:56

or SMS messages.

9:57

Exactly.

9:59

The versatility ensures the alert becomes

10:01

more than just a notification.

10:03

It becomes an actionable ticket or a documented event.

10:06

You can send it to Telegram, Signal,

10:08

or just use generic webhooks.

10:10

The goal is to make sure your alert doesn't just get lost.

10:13

And speaking of what this is all built on,

10:15

let's look under the hood.

10:16

The fact that this is all open source is a huge benefit.

10:19

It's a massive benefit.

10:21

It's written primarily in Python and Django,

10:24

pretty modern versions, too.

10:26

And it's licensed under the BSD three clause license.

10:29

You can see its popularity on GitHub.

10:31

It's got almost 10,000 stars.

10:33

And that open source foundation gives you a choice.

10:35

You can use the hosted service for convenience, zero setup.

10:39

Or they provide a reference Docker file and pre-built Docker

10:43

images so you can self-host the entire thing.

10:45

And if you do go down that path, that path of control

10:48

and self-hosting, you suddenly inherit all the maintenance

10:51

jobs, right?

10:52

The architecture has these specialized management commands

10:54

that you have to run.

10:55

You do.

10:55

For instance, you need a command called send alerts

10:58

running constantly.

10:59

It's what pulls the database and actually sends out

11:02

the notifications when a check goes to down.

11:04

And there's another one for email, right?

11:06

An SMPD listener.

11:07

Yeah, that's a really neat feature.

11:09

It allows the system to receive pings not just over HTTP,

11:13

but also as email messages sent to a check's unique email

11:16

address.

11:17

And what about that maintenance image?

11:19

What kind of cleanup are we talking about?

11:20

Well, if you decide to use external object storage,

11:23

like Amazon S3, to store large ping bodies,

11:27

maybe you want to attach a full log file for debugging.

11:30

OK, yeah.

11:31

You have to run a dedicated cleanup command called prune

11:33

objects all the time.

11:35

If you don't, you'll just be paying to store

11:37

ancient, useless data forever.

11:39

And your storage costs will go through the roof.

11:41

You have to actively manage it.

11:43

So wrapping this all up, what's the big takeaway

11:45

here for you, for the listener?

11:47

I think the fundamental shift this architecture gives you

11:49

is just profound.

11:51

It is.

11:51

It transforms that risk of silent failure

11:54

in all your scheduled tasks, whether they're

11:56

on a Raspberry Pi or in Kubernetes.

11:58

It transforms it into a guaranteed, immediate alert

12:01

when something goes O-wall.

12:03

It's just peace of mind.

12:04

And that actually leads us to a bit of a provocative thought

12:07

for you to consider.

12:08

While that hosted service provides simplicity,

12:10

the open source option gives you control.

12:13

But that control comes with some significant operational

12:17

complexity.

12:18

Right, especially if you're trying to run this in, say,

12:19

a large enterprise environment.

12:21

You start facing some pretty complex decisions

12:23

around security and integration.

12:25

Exactly.

12:26

I mean, think about it.

12:27

If you enable external authentication using HTTP

12:30

headers to integrate with your company's single sign-on,

12:35

you are implicitly trusting those headers.

12:38

And if an attacker can compromise the proxy that

12:40

sends those headers.

12:41

They can impersonate any user they want.

12:43

And similarly, with external object storage,

12:45

that requires extremely careful credential management.

12:48

And like we said, you have to run those cleanup commands.

12:51

So this choice of convenience versus control,

12:54

it forces you to grapple with full-scale infrastructure

12:57

and security challenges that go way

12:59

beyond just basic monitoring.

13:01

It's a serious trade-off to consider.

13:03

A really powerful thought to chew on

13:05

as you design your own monitoring systems.

13:07

How much control are you really willing to take on?

13:10

And just a quick final reminder that this deep dive

13:13

was brought to you by Safe Server.

13:14

You can find out how Safe Server supports hosting

13:16

and digital transformation at www.safeserver.de.

13:20

Thanks again to them.

13:21

So go forth, stop fearing those silent failures,

13:24

We will catch you on the next deep dive.

13:24

We will catch you on the next deep dive.

Today's Deep-Dive: Healthchecks

Episode description

Persons