Today's Deep-Dive: Healthchecks
Ep. 369

Today's Deep-Dive: Healthchecks

Episode description

Silent failures are one of the most dangerous risks in modern systems - when critical jobs stop running and no one notices until it’s too late. In this episode, we explore healthchecks.io, an elegant open-source solution that turns background tasks into actively monitored systems.

At the core is the “ping model”: every scheduled job sends a simple HTTP request when it completes successfully. If that ping doesn’t arrive within an expected timeframe, the system assumes failure and triggers an alert. This shifts monitoring from reactive log-checking to proactive detection of missing signals.

We break down how to configure effective monitoring using key concepts like period (expected run interval) and grace time (buffer for delays), and how these combine to prevent false alarms while still catching real failures. The system’s state model - up, late, and down - ensures alerts are meaningful and reduces notification fatigue.

Beyond cron jobs, healthchecks.io can monitor a wide range of systems, from Kubernetes jobs and CI pipelines to IoT devices and simple server health checks. Its flexible integrations - Slack, PagerDuty, email, webhooks, and more - ensure alerts reach the right place at the right time.

Finally, we explore the trade-offs between using the hosted service and self-hosting the open-source version, where greater control comes with added responsibility for security, maintenance, and infrastructure management.

If you rely on scheduled tasks, this deep dive shows how a simple concept - monitoring by absence instead of presence - can eliminate one of the most costly and invisible failure modes in software systems.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now!

Download transcript (.srt)
0:00

You know that specific feeling of dread

0:03

when you realize a system has been failing

0:05

but completely behind the scenes?

0:07

Not a big catastrophic crash

0:09

that throws up error messages everywhere,

0:11

but that slow, invisible rot.

0:14

A silent failure.

0:15

It's just devastating because nothing screams at you.

0:17

There are no red lights flashing.

0:19

Exactly.

0:19

So maybe that nightly database backup

0:22

just stopped running two weeks ago.

0:24

Or your script that calculates critical business metrics,

0:28

it just choked on some bad data and died.

0:32

Silently.

0:32

And that crucial task goes unnoticed for days,

0:35

weeks, maybe even months.

0:38

And then you discover you've lost weeks of data

0:40

or even worse, your SSL certificate has silently expired.

0:44

And it takes your whole website

0:45

down in the middle of the night.

0:46

We rely so heavily on these scheduled tasks,

0:48

cron jobs, system timers, you name it,

0:51

but we sort of operate on this dangerous assumption

0:53

that they're all still running happily.

0:55

We set them and we forget them.

0:57

We do.

0:57

So today we are tackling that vulnerability head on.

1:01

And our mission in this deep dive

1:02

is really for anyone who manages these kinds of tasks.

1:05

It doesn't matter if you're a beginner developer

1:07

or a seasoned sysadmin.

1:09

We're gonna extract the core concepts

1:12

behind proper background job monitoring.

1:15

And we're gonna use the architecture

1:17

of a service called healthchecks.io to do it,

1:20

making this whole thing really accessible.

1:21

The fundamental shift is turning a passive forgotten job

1:25

into an actively monitored system

1:28

instead of waiting for something bad to happen.

1:30

You make the job itself report its success.

1:33

And if that report doesn't show up on time,

1:35

that's when you sound the alarm.

1:36

Exactly.

1:37

But first, before we get into the nuts and bolts,

1:40

let's take a moment to thank the supporter

1:42

of this deep dive.

1:43

Absolutely.

1:43

Safe Server supports the hosting of this kind of software.

1:46

It can really assist with your digital transformation.

1:49

If you wanna know more about how they can help you,

1:51

you can find more information at www.safeserver.de.

1:56

A huge thank you to them.

1:58

Okay, so let's unpack this core idea,

2:00

this monitoring by exception,

2:01

or what they call the ping model.

2:04

It's so simple, and I think that's its biggest strength.

2:06

It really is.

2:07

It basically boils down to just three steps.

2:09

First, you generate a unique ping URL

2:13

for every single background job you care about.

2:16

So it's like a personalized digital doorbell

2:18

for that one specific task.

2:21

That's a great way to think about it.

2:22

And step two is where you go into your actual job,

2:25

your script, your code, whatever it is,

2:27

and you make the very last thing it does,

2:29

assuming everything ran successfully.

2:31

Is send a little HTTP request, a ping, to that unique URL.

2:36

Right, it's calling home and saying,

2:38

hey, I finished, everything's fine.

2:40

I'm okay.

2:41

And then step three is the monitoring systems part.

2:43

It's just sitting there waiting for that unique ping

2:45

to arrive within the time it expects.

2:47

If it does not get that ping on time, that's the exception.

2:50

And it knows something's wrong.

2:51

Either the job didn't start, it crashed,

2:53

it couldn't reach the internet,

2:55

whatever the reason, it sends you an alert.

2:57

And it's checking for silence,

2:58

not for specific error messages buried in log files.

3:02

Okay, but hold on.

3:03

What if my job runs perfectly,

3:05

but then my server's network connection drops

3:08

right before it sends the ping?

3:10

Isn't that just creating a false alarm?

3:12

That's an excellent, really critical question.

3:15

The failure we're catching here

3:16

is the job failing to complete

3:18

its entire operational lifecycle.

3:20

And that includes communication.

3:23

Ah.

3:23

So it ensures the whole chain is intact.

3:25

I mean, if the job runs,

3:27

but the server can't reach the outside world,

3:29

that is still a failure that demands your attention.

3:32

Right, because the next critical job

3:34

might also need to reach an update server or something.

3:36

Exactly.

3:37

It confirms your operational readiness end to end.

3:40

That makes a lot of sense.

3:41

And it makes this model just incredibly versatile.

3:43

We're talking about everything from, I don't know,

3:45

simple DNS updates to really complex metric calculations.

3:50

And what's great, especially for beginners or small teams,

3:52

is the low barrier to entry.

3:55

This platform offers a pretty generous free tier.

3:58

You can monitor 20 cron jobs for free.

4:00

You don't even need a credit card.

4:02

So you can get immediate control over your tasks.

4:05

OK, moving on from the basic ping.

4:07

Once you have your job calling home,

4:09

you need to configure when the system should actually

4:11

start to worry.

4:12

Right, you get this live updating dashboard.

4:14

You can name and tag all your checks

4:16

to keep things organized.

4:17

But the real insight comes from mastering

4:21

the two main parameters that govern the system's patients.

4:24

And those are the period and the grace time.

4:26

That's right.

4:27

So let's break those down.

4:28

The period seems pretty straightforward.

4:30

It's just the expected time between successful pings.

4:34

If my job runs every Tuesday at noon,

4:36

the period is seven days.

4:37

Exactly.

4:39

But in the real world, things can run a little late.

4:42

And that's where the grace time comes in.

4:43

It's the buffer.

4:44

It's the extra time you allow.

4:46

So if my payroll calculation normally takes, say, 30 minutes,

4:50

I might set the grace time to 90 minutes

4:52

just to account for, I don't know, high database

4:55

load or something.

4:55

Precisely.

4:56

You always want to set it slightly

4:57

above the longest you'd ever expect that job to take.

5:00

So the system isn't just panicking

5:02

the second the period expires.

5:04

It gives the job some room to breathe.

5:06

And these two parameters, they work together

5:08

to define four key states, which is, I think,

5:11

crucial for preventing alert fatigue.

5:13

Oh, absolutely.

5:14

So first, you have new.

5:16

The check was just created.

5:17

It hasn't heard anything yet.

5:18

It's full enough.

5:19

Then you have up, which means the last ping arrived

5:22

within the period.

5:23

Everything is healthy.

5:24

Everything is on schedule.

5:25

And then things get interesting.

5:26

We hit the pre-alert stage late.

5:29

So the time since the last ping has gone past the period,

5:33

but it's not yet past the period plus the grace time.

5:35

Correct.

5:36

It's delayed, but it's still inside that acceptable buffer

5:39

you defined.

5:39

And then finally, the alert state down.

5:42

The time has now exceeded the period plus the grace time.

5:46

And we should emphasize, the notification

5:48

is sent specifically when that check transitions

5:51

from late to down.

5:52

That one specific moment.

5:53

Yes.

5:54

And that mechanism is a lifesaver.

5:56

It means you only get paged when the delay is officially

5:59

a mission-critical failure.

6:01

It saves you from so many midnight alerts.

6:03

That is a fantastic way to maintain sanity.

6:06

It's not just it's late.

6:07

It is now officially too late.

6:09

Right.

6:10

Now as an alternative to just a simple period and grace time,

6:14

you can also use Kronexpression syntax.

6:16

Right.

6:16

Yes.

6:17

And you'd use that for more complex schedules.

6:19

Say you have a job that runs on the first Monday

6:22

of every quarter.

6:23

You could never define that with just period.

6:26

It would be impossible.

6:27

So Kronexpression syntax lets you define those irregular

6:29

schedules very precisely, telling the service

6:32

the exact times it should expect to hear from that job.

6:35

And beyond all the configuration,

6:37

you get a lot of transparency.

6:39

You can see a detailed event log of every ping

6:41

that's come in, every down notification that's gone out.

6:44

And they also have these things called status badges.

6:46

They're little graphics with these hard-to-guess URLs

6:49

that you can embed in, say, a project re-enemy file

6:53

or a public status page.

6:55

To show the live health of your tasks.

6:57

Yeah.

6:57

Pretty cool.

6:58

OK, so we've really nailed down the win and the logic.

7:02

Now let's talk about the what, the scope.

7:04

What can you actually monitor with this model?

7:06

It's so much broader than just the traditional Linux

7:08

cron system.

7:09

I mean, it's a perfect fit for a huge range

7:12

of scheduled environments.

7:13

So we're talking modern stuff, like Kubernetes cron jobs.

7:15

Yep.

7:16

And older systems like Windows schedule tasks,

7:19

build pipelines in Jenkins, Heroku scheduler,

7:22

even WordPress's Ropey car cron, which

7:24

is famous for failing silently and just wrecking sites.

7:28

So if it runs periodically, it's a target.

7:30

It is.

7:31

And the practical use cases are things that really

7:33

need guaranteed uptime.

7:34

We're talking file system and database backups generating

7:37

those daily or weekly report emails.

7:39

The really essential SSL certificate renewals.

7:42

Oh, absolutely.

7:43

And those vital business data import and sync jobs,

7:47

if any of those fail, the business

7:49

faces real consequences.

7:51

But it seems like the utility goes even beyond just

7:54

scheduled software tasks.

7:56

You can use this for really lightweight server health

7:58

checks, like a heartbeat for your infrastructure.

8:01

This is where the versatility of a simple HTTP really shines.

8:05

Instead of installing some massive monitoring

8:07

agent on a server, you can just write a tiny shell script.

8:10

And that script checks a condition.

8:12

Is a certain Docker container running?

8:14

Or do we have enough free disk space?

8:16

Exactly.

8:17

And if that check succeeds, the script just pings the URL.

8:20

If the whole server dies, well, the ping stops,

8:23

and you get an alert.

8:24

So you could check, if an application process is running,

8:27

you could monitor database replication lag,

8:30

or even just send simple I'm alive pings from a NAS box

8:34

or a Raspberry Pi.

8:35

Yeah, it gives you an easy central place

8:37

to see the health of things that might be scattered

8:40

all over the place physically.

8:41

And the real world impact of that is so clear.

8:43

I saw a great testimonial calling the service

8:46

an absolute lifesaver.

8:49

Oh, the IoT gateway one?

8:50

Yeah.

8:51

Someone was using it to monitor an IoT gateway.

8:53

And because they got a quick heads up

8:55

that it had gone offline, they were

8:57

able to save the device from literally being fried.

9:00

Right, because it had been accidentally

9:01

placed on top of a hot router while someone was cleaning.

9:03

That immediate notification prevented

9:06

a physical piece of hardware from being destroyed.

9:08

And that anecdote just perfectly illustrates the value, right?

9:11

Proactive monitoring over waiting for something

9:14

to actually break or catch fire.

9:16

For sure.

9:17

OK, so let's pivot to the ecosystem.

9:18

How do you make sure those alerts actually

9:20

turn into action?

9:22

That's all about integrations.

9:24

A notification is useless.

9:25

If it just lands in an inbox, you never check.

9:28

And this is where the platform really excels.

9:30

It has more than 25 integrations for different notification

9:32

channels.

9:34

The whole point is to make sure the alert finds

9:36

the right person at the right time.

9:37

So if my nightly backup fails, I probably

9:40

want that alert to pop up directly in the Slack channel

9:43

where my DevOps team lives, or maybe Microsoft teams.

9:46

But if it's a really high priority incident,

9:48

you want to integrate it with something like PagerDuty

9:51

or Ops Genie.

9:52

That guaranteed incident escalation.

9:54

Yeah, something that can actually trigger phone calls

9:56

or SMS messages.

9:57

Exactly.

9:59

The versatility ensures the alert becomes

10:01

more than just a notification.

10:03

It becomes an actionable ticket or a documented event.

10:06

You can send it to Telegram, Signal,

10:08

or just use generic webhooks.

10:10

The goal is to make sure your alert doesn't just get lost.

10:13

And speaking of what this is all built on,

10:15

let's look under the hood.

10:16

The fact that this is all open source is a huge benefit.

10:19

It's a massive benefit.

10:21

It's written primarily in Python and Django,

10:24

pretty modern versions, too.

10:26

And it's licensed under the BSD three clause license.

10:29

You can see its popularity on GitHub.

10:31

It's got almost 10,000 stars.

10:33

And that open source foundation gives you a choice.

10:35

You can use the hosted service for convenience, zero setup.

10:39

Or they provide a reference Docker file and pre-built Docker

10:43

images so you can self-host the entire thing.

10:45

And if you do go down that path, that path of control

10:48

and self-hosting, you suddenly inherit all the maintenance

10:51

jobs, right?

10:52

The architecture has these specialized management commands

10:54

that you have to run.

10:55

You do.

10:55

For instance, you need a command called send alerts

10:58

running constantly.

10:59

It's what pulls the database and actually sends out

11:02

the notifications when a check goes to down.

11:04

And there's another one for email, right?

11:06

An SMPD listener.

11:07

Yeah, that's a really neat feature.

11:09

It allows the system to receive pings not just over HTTP,

11:13

but also as email messages sent to a check's unique email

11:16

address.

11:17

And what about that maintenance image?

11:19

What kind of cleanup are we talking about?

11:20

Well, if you decide to use external object storage,

11:23

like Amazon S3, to store large ping bodies,

11:27

maybe you want to attach a full log file for debugging.

11:30

OK, yeah.

11:31

You have to run a dedicated cleanup command called prune

11:33

objects all the time.

11:35

If you don't, you'll just be paying to store

11:37

ancient, useless data forever.

11:39

And your storage costs will go through the roof.

11:41

You have to actively manage it.

11:43

So wrapping this all up, what's the big takeaway

11:45

here for you, for the listener?

11:47

I think the fundamental shift this architecture gives you

11:49

is just profound.

11:51

It is.

11:51

It transforms that risk of silent failure

11:54

in all your scheduled tasks, whether they're

11:56

on a Raspberry Pi or in Kubernetes.

11:58

It transforms it into a guaranteed, immediate alert

12:01

when something goes O-wall.

12:03

It's just peace of mind.

12:04

And that actually leads us to a bit of a provocative thought

12:07

for you to consider.

12:08

While that hosted service provides simplicity,

12:10

the open source option gives you control.

12:13

But that control comes with some significant operational

12:17

complexity.

12:18

Right, especially if you're trying to run this in, say,

12:19

a large enterprise environment.

12:21

You start facing some pretty complex decisions

12:23

around security and integration.

12:25

Exactly.

12:26

I mean, think about it.

12:27

If you enable external authentication using HTTP

12:30

headers to integrate with your company's single sign-on,

12:35

you are implicitly trusting those headers.

12:38

And if an attacker can compromise the proxy that

12:40

sends those headers.

12:41

They can impersonate any user they want.

12:43

And similarly, with external object storage,

12:45

that requires extremely careful credential management.

12:48

And like we said, you have to run those cleanup commands.

12:51

So this choice of convenience versus control,

12:54

it forces you to grapple with full-scale infrastructure

12:57

and security challenges that go way

12:59

beyond just basic monitoring.

13:01

It's a serious trade-off to consider.

13:03

A really powerful thought to chew on

13:05

as you design your own monitoring systems.

13:07

How much control are you really willing to take on?

13:10

And just a quick final reminder that this deep dive

13:13

was brought to you by Safe Server.

13:14

You can find out how Safe Server supports hosting

13:16

and digital transformation at www.safeserver.de.

13:20

Thanks again to them.

13:21

So go forth, stop fearing those silent failures,

13:24

We will catch you on the next deep dive.

13:24

We will catch you on the next deep dive.