Today's Deep-Dive: Prometheus
Ep. 132

Today's Deep-Dive: Prometheus

Episode description

The discussion explores the importance of monitoring systems in the digital world, specifically focusing on Prometheus, an open-source tool for systems and service monitoring. Prometheus functions as both a monitoring system and a time series database, recording various metrics over time, similar to a health check for software. It employs a unique multi-dimensional data model, allowing it to attach contextual labels to data points for richer insights. Users can query this data using Prometheus Query Language (PROMQL), which enables detailed analysis and visualization of trends. Prometheus collects data using an HTTP pull model, actively retrieving measurements from monitored systems, though it can also accommodate short-lived processes through a gateway for data pushing.

Service discovery features allow Prometheus to automatically identify new systems to monitor, while static configuration provides a manual option. For visualization, Prometheus includes an Expression Browser and integrates seamlessly with Grafana for advanced dashboarding. The system is designed for efficient data storage and redundancy, ensuring reliable operation even if individual servers fail. Alerts can be set up using PROMQL to notify teams of potential issues, managed by the Alert Manager. Overall, Prometheus is a critical tool for maintaining the health of digital services, and its open-source nature makes it accessible for everyone interested in monitoring their software systems.

Download transcript (.srt)
0:00

ever take a look at your car's dashboard?

0:02

You know, just to check that everything's running

0:04

how it should be.

0:05

Yeah, definitely.

0:07

Well, today we're doing kind of the same thing,

0:09

but for the digital world,

0:11

we're peeking under the hood at these systems

0:13

that make sure all our favorite websites and apps

0:15

are running smooth and staying healthy.

0:17

Makes sense.

0:18

Yeah, like, ever had a website just freeze up on you?

0:21

Or maybe an app that just crashes out of nowhere?

0:24

Happens all the time.

0:25

Well, chances are there's a monitoring system

0:28

working behind the scenes,

0:29

trying to figure out what went wrong.

0:30

And that's exactly what we're gonna explore today.

0:32

Sounds interesting.

0:33

Specifically, we're gonna do a deep dive into Prometheus.

0:36

Prometheus.

0:37

Yeah, super popular.

0:38

It's this open source tool that a lot of folks use for,

0:42

well, this kind of monitoring.

0:43

And for this deep dive, we went straight to the source.

0:46

We got all this great info directly

0:48

from the Prometheus project on GitHub

0:51

and their official website too.

0:53

So, straight from the horse's mouth.

0:55

Exactly.

0:56

But before we really get started,

0:58

I want to give a big shout out to Safe Server.

1:00

They're the ones who made this whole deep dive possible.

1:03

They provide amazing hosting for software,

1:06

and they even offer some really expert advice

1:08

on digital transformation too.

1:10

So, if you're interested, check them out.

1:12

www.safe-server.de.

1:15

I'll have to take a look.

1:16

So, today's goal is pretty simple.

1:18

We want to give you a clear, easy-to-understand

1:20

introduction to Prometheus.

1:22

What it is, how it works, and why

1:25

it's become so important for, well, all things software.

1:28

Sounds good to me.

1:29

Yeah.

1:29

We're really aiming to make this accessible to everyone,

1:32

even if you're just kind of curious about what

1:33

goes on behind the scenes in the digital world.

1:35

Right, right.

1:36

OK, so let's jump in.

1:38

Right off the bat, the Prometheus GitHub page calls it,

1:41

and I'm quoting here, a systems and service monitoring system

1:45

and a time series database.

1:50

Yeah, so what does that actually mean in plain English?

1:53

Well, a monitoring system is, well,

1:55

pretty much what it sounds like.

1:56

It's a way to keep a close eye on all those digital tools

1:59

we use, making sure they're running the way they should,

2:01

kind of like a doctor, constantly checking

2:03

a patient's vital signs.

2:05

And then there's the whole time series database part.

2:08

Think of it like a diary, but for your software.

2:10

It's constantly recording different measurements,

2:12

like how busy the software is or how fast it's responding,

2:15

and it does this at specific points in time.

2:18

So you can actually see how things change over days, hours,

2:21

even minutes.

2:22

Oh, interesting.

2:22

So it's like trapping the ups and downs.

2:24

Exactly.

2:25

And the way Prometheus organizes all that info in this diary,

2:29

it's pretty neat.

2:30

Instead of just having simple entries,

2:32

it uses something called a multi-dimensional data model.

2:34

Multi-dimension?

2:35

Yeah, it basically means that every bit of info,

2:38

Prometheus records, has a specific name.

2:41

We call that the metric name.

2:42

It could be something like website visits or even

2:45

server temperature.

2:46

OK, so far so good.

2:47

But here's the cool part.

2:49

It also has these extra labels attached to it.

2:52

Those are the key value dimensions.

2:54

Think of it like this.

2:55

Instead of just saying temperature, 25 degrees,

2:59

Prometheus might record something like temperature.

3:01

And then in curly brackets, room equals living room,

3:05

sensor equals one, and then 25 degrees.

3:08

So it's adding context to the numbers.

3:10

Exactly.

3:10

Those extra labels like room and sensor,

3:13

they give you a much richer picture of what's going on.

3:16

Then you can ask some really specific questions like,

3:18

what was the temperature in the living room for the past hour?

3:22

That makes sense.

3:23

Yeah.

3:23

And to actually ask those specific questions,

3:26

Prometheus has this special language.

3:28

It's called PROMQL.

3:30

PROMQL?

3:31

Stands for Prometheus Query Language.

3:33

OK, another acronym.

3:35

I know, right?

3:35

But query language might sound kind of intimidating,

3:39

but it's really just a way to search and analyze

3:41

all that data in your software's diary.

3:44

So filter in the noise?

3:46

Exactly.

3:47

Imagine you have a giant spreadsheet

3:49

with all these measurements.

3:50

PROMQL is like the super-powered search bar,

3:53

letting you pull out exactly the info you need.

3:56

You can use it to create graphs, spot trends, even

3:59

set up alerts.

4:01

For example, you could use PROMQL to say,

4:03

show me a graph of how many people logged

4:05

into the website in the last hour.

4:07

Or maybe something like, tell me if the server's memory usage

4:09

has been too high for more than five minutes.

4:11

That's pretty powerful.

4:12

Oh, yeah.

4:13

That's the power of PROMQL.

4:15

It helps you turn all that raw data into actual useful insights.

4:19

Makes sense.

4:21

So Prometheus is recording all this data,

4:23

but how does it actually collect it?

4:25

Well, the way it collects data is pretty interesting, too.

4:28

It uses what's known as an HTTP pull model.

4:31

An HTTP pull model.

4:33

Yeah, so basically, instead of your software

4:35

just sending its measurements to Prometheus automatically,

4:38

Prometheus actually reaches out and asks

4:40

for the latest readings.

4:41

And it does this at regular intervals.

4:43

So it's actively checking in.

4:45

Exactly.

4:45

Kind of like a health inspector.

4:46

They go around to different restaurants

4:48

and check on things instead of waiting for the restaurants

4:50

to report in.

4:51

That's a good analogy.

4:52

Yeah, this whole polling approach,

4:53

it gives you more control and can be way more

4:56

reliable in some cases.

4:57

Especially if the system's being monitored,

5:00

they have spotty internet connections.

5:03

While Prometheus has to actively go and get the data,

5:06

it's actually more robust than relying

5:08

on each individual system to constantly push

5:11

the data its way.

5:12

I see.

5:12

So Prometheus is actively going out and asking for this info.

5:16

But what happens if you have a process that

5:19

only runs for a short time?

5:21

Like a script that just does something once a day

5:24

and then shuts down.

5:25

It wouldn't even be there for Prometheus to check in with.

5:28

Oh, that's a great point.

5:29

So for those kind of short-lived batch jobs,

5:33

as they're called, Prometheus can also handle pushing data.

5:37

And it does this through something called a gateway.

5:39

A gateway.

5:40

Yeah, so instead of waiting around to be asked,

5:43

the batch job can send its data directly to this gateway

5:46

when it finishes up its work.

5:48

Then Prometheus can come along later

5:50

and pull the data from there.

5:51

Like if you need to send a message,

5:53

but the other person's not always available,

5:55

so you just leave it at a central mailbox

5:56

for them to pick up later.

5:57

Makes sense.

5:58

So Prometheus is gathering all this data,

6:00

but it needs to know where to go to get it

6:02

in the first place, right?

6:03

How does it figure out what to actually monitor?

6:06

I saw something about service discovery and static configuration.

6:10

Yeah.

6:11

So service discovery is basically Prometheus

6:14

being really smart and automatically finding

6:16

the things it should be keeping an eye on.

6:18

In today's software world, things are constantly changing.

6:21

New servers popping up, old ones shutting down.

6:24

But Prometheus can actually connect

6:26

with systems that manage all these changes,

6:28

like Kubernetes or different cloud platforms,

6:32

so it automatically knows when something new pops up.

6:35

And it just starts monitoring it without you

6:37

having to lift a finger.

6:38

Wow, that's convenient.

6:39

Right.

6:39

And then there's static configuration.

6:41

That one's a bit simpler.

6:43

You basically just give Prometheus

6:44

a list, like a list of all the specific addresses

6:47

of the systems you want it to monitor.

6:49

And you put this list in its configuration file.

6:52

Both methods just make sure Prometheus

6:54

knows where to find the data it needs.

6:57

So we're collecting data.

6:58

We got this awesome language to ask all sorts of questions

7:01

about it.

7:01

But how do we actually see what's going on?

7:04

Well.

7:04

When I saw something in the documentation about graphing

7:06

and dashboarding support, it even

7:08

mentioned a built-in expression browser and integration

7:11

with Grafana.

7:12

Yes.

7:12

Seeing all that data visually is super important.

7:15

It helps you understand trends and spot problems quickly.

7:18

Prometheus actually has a basic tool built right in.

7:21

It's called the Expression Browser.

7:24

You can type in your PromQL queries right there

7:26

and see the results show up as graphs or tables.

7:29

Pretty handy.

7:30

So neat.

7:30

But for something more sophisticated,

7:32

something that'll give you a really good overview,

7:34

Prometheus often works with another open source

7:36

tool called Grafana.

7:38

Grafana can hook right into Prometheus

7:40

and use it as a data source.

7:42

Then you can build these really rich customizable dashboards.

7:46

It's great for visualizing all your important monitoring data

7:49

all in one place.

7:50

That sounds way better for getting a quick overview.

7:52

Now think about it.

7:53

All this time-based data we're talking about,

7:55

it's got to take up a lot of storage, right?

7:57

Oh, for sure.

7:58

Efficient storage is critical for any system that's

8:01

dealing with this much data over time.

8:03

But Prometheus is pretty clever about it.

8:06

It stores its data in this special format

8:08

on the local disk of the server it's running on.

8:11

This format is designed specifically

8:13

for time series data, making it super efficient to store

8:16

and query the info.

8:18

And to speed things up even more,

8:19

it keeps some of the most recent data in memory.

8:22

So it's optimized for speed.

8:24

Exactly.

8:24

And another important thing is that each Prometheus server

8:27

is kind of like its own little island.

8:29

It manages its own data and doesn't really

8:32

rely on other servers.

8:33

This makes it more reliable.

8:35

Because even if one server crashes,

8:37

the others can keep on trucking.

8:39

For larger setups, you might have multiple of Prometheus

8:41

instances running, each one keeping

8:43

an eye on a different part of your infrastructure.

8:45

So it's designed for redundancy, too.

8:47

That's great.

8:48

But speaking of things going wrong,

8:50

how does Prometheus actually let you know when there's a problem?

8:53

I think I read something about precise alerting

8:55

based on that PromQL language we talked about.

8:58

Ah, yeah, this is where Prometheus gets really proactive.

9:01

With PromQL, you can set up what are called alerting rules.

9:05

Hearing rules.

9:06

Yeah.

9:07

These rules are basically like instructions.

9:10

They say something like, if this specific thing happens

9:13

in our data, send out an alert.

9:15

Could be something like website response times getting too slow,

9:19

or a server running out of memory,

9:21

whatever you define as a potential problem.

9:23

And because these rules are based on PromQL,

9:26

they can be really specific.

9:28

Exactly.

9:29

They can even factor in those multidimensional labels

9:31

we talked about earlier.

9:32

So you can get super granular with your alerts.

9:35

Now when an alert is triggered, Prometheus

9:37

doesn't actually send the notification itself.

9:40

It hands it off to another tool called Alert Manager.

9:42

Alert Manager.

9:43

What's that do?

9:44

Well, Alert Manager is the one that

9:45

handles all the notifications.

9:47

It groups similar alerts together, silences them

9:49

if needed, and makes sure they get to the right people

9:52

through channels like email, Slack, or even text messages.

9:55

So it's like the messenger.

9:56

Exactly.

9:58

This whole system helps teams respond to issues super quickly

10:01

before they turn into major headaches.

10:03

That sounds incredibly valuable.

10:04

Now, for the folks out there who are actually building these software

10:07

services, how easy is it for them to make their apps,

10:10

talk to Prometheus, and share all these metrics?

10:13

I saw that the documentation mentioned client libraries.

10:17

Oh, yeah, the client libraries.

10:18

This is a huge advantage of using Prometheus.

10:21

It's got these libraries available for instrument and code

10:24

in over 10 popular programming languages, Python, Java, Go,

10:30

you name it.

10:31

Instrument, what's that mean?

10:32

It basically means add little snippets of code

10:35

to your application so it can expose

10:37

all those internal metrics in a format

10:39

that Prometheus understands.

10:40

So it's like speaking the same language.

10:42

Exactly.

10:43

These libraries make it super easy for developers

10:45

to keep track of all sorts of things,

10:47

like how many requests their app is handling,

10:49

how long those requests are taking to process,

10:51

how much memory the app is using, all that good stuff.

10:53

It's like building sensors right into your software

10:55

so you can get a clear read-in of its vital signs.

10:58

That's a great way to put it.

10:59

Now, what about systems or applications

11:01

that weren't built using these fancy client libraries,

11:05

like existing third-party software, or maybe even

11:09

hardware devices?

11:10

That's where exporters come in.

11:12

Exporters.

11:13

Yeah, think of them like translators.

11:15

They bridge the gap between different systems.

11:17

There's a ton of exporters out there

11:19

that can collect metrics from all sorts of third-party stuff,

11:22

like your operating system, Docker containers, databases,

11:26

web servers, all that.

11:27

And then they present those metrics in a way

11:29

that Prometheus can understand.

11:30

So basically, you can integrate Prometheus

11:32

with a huge range of technologies

11:34

without having to actually modify those systems directly.

11:37

That's incredibly flexible.

11:38

It is.

11:39

So for anyone listening who might

11:41

be interested in trying Prometheus out for themselves,

11:44

what's the best way to get started?

11:45

The GitHub page mentioned it's open source

11:48

and part of the Cloud Native Computing Foundation.

11:50

They also talked about different ways to install it.

11:52

You got it.

11:53

Prometheus is 100% open source, so it's

11:55

free to use and modify to your heart's content.

11:58

That's awesome.

11:59

It is.

12:00

And it's also a graduated project

12:01

under the Cloud Native Computing Foundation.

12:04

That's a pretty big deal, actually.

12:06

It means that it's a mature, stable, and widely used

12:09

technology within the Cloud Native world.

12:11

And getting started is pretty straightforward, too.

12:14

Like you mentioned, you can just download pre-compiled versions

12:16

for different operating systems straight

12:18

from the Prometheus website.

12:20

That's usually the quickest and easiest way

12:22

to get it up and running.

12:23

Makes sense.

12:24

But if you're comfortable with containers,

12:26

there's also official Docker images available.

12:30

And if you're more technically inclined,

12:32

or maybe you want to contribute to the project itself,

12:35

you can even build it directly from the source code.

12:37

The Prometheus website, prometheus.io,

12:40

has detailed instructions for all these methods.

12:43

So there's really an option for everyone.

12:45

Exactly.

12:46

So to quickly recap our deep dive into Prometheus,

12:48

it's this amazing open source monitoring system

12:51

and time series database that helps you understand

12:54

how healthy and how well your software

12:56

and services are performing.

12:58

Uses a multi-dimensional data model,

13:00

has this powerful query language called PromQL,

13:03

and uses a pull-based approach to gather metrics.

13:06

Right.

13:07

And it offers precise learning,

13:10

integrates with visualization tools like Grafana,

13:13

and has a whole ecosystem of client libraries

13:15

and exporters to make things easier.

13:18

It's honestly a fundamental tool

13:19

for keeping the digital world running smoothly.

13:21

Couldn't agree more.

13:23

And one final thought for you,

13:24

even if you don't work directly with software,

13:27

just think about how much you rely on digital services

13:29

every single day.

13:31

Behind the scenes, tools like Prometheus

13:33

are constantly working hard to make sure those services

13:36

are available and work in the way they should.

13:38

True. Yeah.

13:39

Just understanding the basics

13:40

of how these systems are monitored,

13:42

it can really give you a new appreciation

13:44

for the complexity and the effort

13:46

that goes into keeping our connected world running.

13:49

And if you're interested in digging deeper,

13:51

I highly encourage you to visit

13:52

the official Prometheus website, prometheus.io.

13:56

It's a fantastic resource, trust me.

13:58

I'll have to check it out.

13:59

Definitely do.

14:00

Well, that was our deep dive into Prometheus.

14:02

Thanks for joining us.

14:03

It was fun.

14:04

And a big thanks once again to Safe Server

14:06

for making this whole thing possible.

14:08

If you're looking for reliable software hosting

14:10

or expert advice on all things digital transformation,

14:13

be sure to visit their website at www.safeserver.de.

14:18

They're great.

14:18

They really are.

14:19

All right, that's it for today.

14:21

See ya.

14:21

See ya.