Today's Deep-Dive: Apache Druid
R. 322

Today's Deep-Dive: Apache Druid

Deskrivadur ar rann

Apache Druid is a high-performance, real-time analytics database designed for sub-second query responses on massive datasets, both streaming and historical. Unlike traditional databases optimized for transactions or batch reporting, Druid excels at interactive, ad-hoc analysis, making it ideal for powering dashboards and applications requiring immediate insights. Its architecture prioritizes speed and concurrency, handling hundreds of thousands of queries per second with millisecond response times, even on trillions of rows. This performance is achieved through columnar storage, time indexing, and advanced compression techniques, along with a scatter-gather query approach that minimizes data movement. Druid’s stream-native ingestion capabilities allow it to query data as it arrives, eliminating traditional ETL delays and enabling analysis of live events alongside historical data. The database also boasts an elastic, distributed architecture for independent scaling of components, ensuring reliability and high availability through features like automated recovery and data replication. For developers and analysts, Druid offers accessibility through a familiar SQL API, schema auto-discovery, and a user-friendly web console for management and query prototyping. This focus on resource efficiency and cost-effectiveness, combined with its powerful real-time analytics capabilities, positions Druid as a critical tool in the evolving big data landscape, challenging the competitiveness of traditional architectures for operational applications.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now!

Download transcript (.srt)
0:00

Welcome back to the deep dive. This is where we take a piece of complex tech

0:04

We peel back all that jargon and we just give you the essentials and today we are

0:08

plunging right into the core of big data

0:11

Focusing on something that's just critical for handling these massive volumes of

0:15

data that are arriving, you know right now

0:17

We're talking about Apache druids

0:19

So if you're out there dealing with just mountains of streaming data or maybe you're

0:23

trying to power a dashboard that has to answer questions

0:26

Instantly, I mean not in minutes but in milliseconds then this deep dive is for you.

0:32

Our mission is pretty simple

0:33

we're going to break down Apache drew this powerful real-time analytics database

0:38

and just give you a really clear entry point a way to

0:40

Quickly grasp what it is how it works and why it's becoming so important in the big

0:44

data world

0:44

We've been digging through the core descriptions the feature lists all the

0:48

community docs and we're ready to well distill that for you

0:51

But first I just want to mention that this deep dive is supported by safe server

0:55

safe server handles the hosting for this type

0:57

Of software and supports you in your digital transformation. You can find out more

1:00

at www.safeserver.de

1:03

That's www.safeserver.de. Right. So let's just start with the fundamental

1:07

definition because Apache drew it isn't a general-purpose tool

1:10

It's very very specific. It's a high-performance

1:13

real-time analytics database and its design goal is incredibly focused

1:19

It's all about delivering sub-second queries on just immense amounts of data sub-second

1:25

Okay. Yeah, and that's on both streaming data that just arrived and you know

1:29

Pedabytes of historical data all at scale and under huge load. Think of it this way

1:34

It's an open-source alternative to your traditional data warehouse

1:37

But it's been optimized specifically for that high-speed interactive analysis

1:42

Oh LAP right online analytical processing. Exactly. It's built to answer those

1:47

complex analytical questions instantly

1:48

Okay, let's unpack that OLAP thing immediately because I mean there are countless

1:52

databases out there

1:53

You've got relational no sequel data warehouses. Why do we need a different kind of

1:57

database like druid just for analytics?

1:59

What's the specific problem here that those other tools can't solve?

2:02

The problem is what you could call the time to insight or you know time to action

2:07

Traditional systems, especially the ones built for transactions OLTP. They're great

2:12

for say recording a single sale

2:13

Right one record at a time exactly and even traditional data warehouses, which are

2:19

good for big complex reports

2:21

They're often optimized for batch loading. That means query times can be slow and

2:25

you're asking ad hoc real-time questions

2:27

Druid is designed for that exact moment when you need a live answer

2:31

Right now so for instance you run a gaming platform and you need to know the total

2:36

spend of users in Berlin in the last minute

2:39

Filtered by their device type Wow. Okay, that's specific it is and it requires

2:43

instantaneous analysis across

2:45

Maybe billions of events druid steps in to make those ultra-fast highly dimensional

2:50

workflows possible

2:52

So give me the wow metric here if I'm used to my reports taking I don't know 30

2:56

seconds to run

2:57

What kind of speed are we really talking about with druid? We are talking about

3:00

true millisecond response times millisecond the documentation

3:04

It consistently highlights its ability to run these really complex high-dimensional

3:09

OLP queries in

3:10

Milliseconds and that's even against datasets that range from billions to get this

3:14

trillions of rows trillions

3:16

Okay, and critically it does this without you having to pre-aggregate the data or

3:20

pre-define the exact query

3:22

You're gonna ask you can just ask anything instantly that speed is impressive

3:26

But that usually implies a single highly optimized query what happens when?

3:32

Say a thousand users hit that same data all at once

3:35

Does the performance just tank and that's where its concurrency power is?

3:40

So essential speed is pretty meaningless if the system just buckles under load

3:43

Druids architecture as well. It's fundamentally built to handle that kind of

3:47

immense concurrent load

3:48

We're talking about supporting anywhere from hundreds to even a hundred thousands

3:52

of queries per second

3:53

QPS a hundred thousand queries per second maintaining that consistent low latency

3:58

performance

3:59

This is precisely why it's used to power consumer facing ui's or internal dashboards

4:04

that you know

4:05

Hundreds of employees are using all day every click is a new query every single

4:08

click on a filter a

4:10

Visualization that requires a fresh lightning fast query

4:14

Druid is built for that exact operational load

4:17

Okay, but achieving that kind of performance usually means you have to throw

4:21

massive expensive hardware at the problem

4:23

Does druid manage to be cost-efficient? It absolutely does and it addresses that

4:28

cost question mainly through resource efficiency

4:31

Because the architecture is so optimized and we'll get into the compression and

4:34

indexing in a bit. It just

4:36

Significantly reduces the total hardware footprint you need compared to other

4:40

solutions. Yeah compared to many competing data warehouses

4:44

it's really designed for massive scale, but with

4:46

Resource conservation built right into its DNA. It makes the total cost of

4:51

ownership much much lower over time

4:53

Okay, so this is where it gets really interesting for me the real-time aspect

4:56

We're in a streaming world now data from logs sensors clicks. It never stops

5:01

So how does druid deal with data that showed up one second ago? This is its core

5:05

differentiator

5:06

True stream ingestion druid is you could say stream native if you've ever worked

5:12

with traditional ETL processes

5:14

You know, there's always a delay the data lands. It gets transformed then it's

5:18

loaded that can take minutes

5:20

Sometimes hours the dreaded batch window right druid just eliminates that wait time

5:25

It has native baked in integration with the big streaming platforms like Apache

5:30

Kafka

5:30

Amazon kinesis this allows it to support query on arrival it ingests and

5:36

Immediately makes millions of events per second available for query immediately

5:40

with guaranteed consistency and ultra low latency

5:43

So if a critical event happens like a security alert or a sudden drop in sales

5:48

I can query that event and act on it within a second not wait for some nightly

5:53

batch process

5:54

That is exactly the use case

5:56

You are querying the live stream at the same time as all the historical data

5:59

You can instantly combine the hot fresh data

6:02

What happened the last minute with a massive cold data from the last five years all

6:06

in a single unified?

6:08

Subsecond query and that gives you the complete picture a truly complete up to the

6:12

second operational picture

6:14

It's essential for things like monitoring and alerting apps. Okay, we've

6:17

established. It's ridiculously fast. It's scalable

6:20

It's stream made of now. I want to look under the hood. How how does it sustain?

6:26

millisecond speed on

6:29

trillions of rows

6:31

What's the secret sauce in the architecture?

6:33

Well, the speeds no accident. It starts the very moment the data is ingested the

6:38

second data touches druid

6:39

It's automatically transformed into what they call an optimized data format

6:43

It's a multi-layered process, but we can simplify it down to say three key

6:47

principles

6:48

Okay, first the data is columnarized. This is a huge departure from traditional row

6:52

based databases

6:53

Imagine your data is a physical library a row based database stores every books

6:58

info author title date subject all

6:59

Bundled together on the shelf. That's a row

7:02

So if I just want to know all the subjects I have to pull every single book record

7:05

off the shelf precisely

7:07

You have to read through everything it takes time at waste operation

7:09

Columnar storage is like taking that library and putting all the authors in one

7:13

aisle all the titles in another and all the subjects in

7:16

A third so if you only need to run a query on the city column

7:20

You only read the city aisle you skip all the other massive amounts of unrelated

7:24

data

7:24

It just dramatically cuts down the IO. Okay, so columnar storage is step one for

7:28

efficiency

7:29

What's next step two is that the data is profoundly optimized for time series

7:34

analysis?

7:36

Specifically through time indexing since most data in druid is time series data

7:41

events happening over time

7:42

It immediately indexes everything based on time. It's like installing a master

7:47

clock in your data

7:48

So you can jump to a specific time range instantly instantly the last week the

7:53

third quarter of 2021

7:54

Whatever without scanning anything irrelevant and the third principle is all about

7:57

compression and encoding

7:58

This is the digital shorthand the data gets highly optimized using techniques like

8:02

dictionary encoding and bitmap indexing

8:05

Okay, that sounds like serious jargon. Can you make that a bit more accessible?

8:08

What's dictionary encoding actually doing for me?

8:10

Of course think of dictionary encoding is creating a little lookup table for any

8:15

repetitive values

8:16

So if you have a column with a million rows that only contain five city names

8:20

New York London Paris, Tokyo Sydney druid

8:24

Doesn't store the full-text New York a million times. That would be crazy

8:29

It would instead it assigns it a tiny numerical code like one it stores that

8:33

dictionary mapping just once and then stores

8:36

Millions of tiny ones in the actual data segment

8:39

Ah, so you save a massive amount of storage space and because the index is so much

8:44

smaller the query engine can search it way faster

8:47

It's looking for tiny numbers not long strings. Exactly. It's an ultra-efficient

8:51

data reduction that directly boosts speed

8:53

The result of all this is that every piece of data is basically a pre-optimized

8:57

ready-to-run package from the second

8:59

It's ingested. Okay, so that's the data structure

9:01

How does the query engine actually use that to get to sub second speeds?

9:04

So the engine uses what's called a scatter gatherer approach

9:07

Yeah, and the core philosophy is don't move the data don't want the data data

9:11

movement is slow. It's expensive

9:13

So druid make sure those optimized data segments are loaded directly into memory or

9:19

local

9:19

SSDs on the specific nodes where they live when you run a query the system

9:25

Intelligently figures out which nodes hold the data you need it scatters the work

9:29

to those nodes

9:29

They process it locally using their own CPU and memory and then they only gather

9:33

the small result sets back

9:35

That dramatically cuts down on network latency

9:38

It does it avoids reading anything extra and keeps the network traffic to an

9:42

absolute minimum

9:43

So instead of hauling the entire library to a central desk for processing

9:48

You just send little instruction cards to the specific aisles and they send back

9:51

only the chapter summaries you asked for that's a perfect analogy

9:55

And that localized processing is just crucial for maintaining performance under all

9:59

that concurrency

10:00

We talked about speaking of concurrency

10:02

Applications scale and you know, they often scale unpredictably if my user base triples

10:07

tomorrow

10:07

Can I scale the system quickly and reliably that brings us to its elastic

10:12

architecture?

10:13

Druid was designed from the ground up to be distributed all its components ingestion

10:18

queries orchestration deep storage

10:20

They're all loosely coupled and why does loose coupling matter?

10:24

It means you can scale the query processing nodes completely independently from the

10:28

ingestion nodes or from the storage layer

10:30

So if you suddenly need to handle ten times the query traffic

10:33

You could just provision and add ten times the query nodes in minutes

10:37

You never have to interrupt the system or stop data ingestion. It's incredibly

10:40

flexible scaling both vertically and horizontally

10:43

Yes, and what about when things go wrong? If a node goes down does my whole

10:48

dashboard crash? Absolutely not

10:50

Reliability is a cornerstone druid has non-stop reliability features built in

10:55

things like continuous backup to deep storage like s3

10:59

Or HDFS automated recovery and multi node replication of data segments

11:03

So there's always a copy always if a query node fails

11:06

Another node that already holds a copy of that data just automatically steps in it

11:10

ensures high availability and durability

11:13

This is all very powerful

11:14

But for the user who needs to adopt this the learning curve can be steep

11:18

How accessible is druid for developers or analysts who are used to more traditional

11:22

tools?

11:23

Accessibility has been a huge focus in its recent development

11:27

I mean for one developers and analysts can use the familiar SQL API for everything

11:32

for everything not just querying for all end-to-end

11:34

Data operations you can use standard SQL for querying sure

11:38

But also for defining ingestion tasks and performing data transformations

11:43

If you know sequel you have a massive head start that lowers the barrier to entry

11:47

significantly

11:47

You don't have to learn some proprietary query language

11:50

What about defining the schema the structure of the incoming data is that a rigid

11:55

manual process?

11:56

That's another big accessibility when schema auto discovery gives you the ease of a

12:01

schemal a system

12:02

but with the performance benefits of a strict schema as

12:05

Data streams in druid can automatically detect define and update column names and

12:10

their data types

12:11

You don't have to stop everything and manually define a hundred columns before you

12:15

can core your data

12:16

It just handles that on the fly. So if I were a new user trying to load my first

12:20

data stream

12:21

What's that actual user experience like am I writing yaml scripts from day one? Not

12:27

at all

12:27

There's a really practical built-in web console. It's designed for easy interaction

12:33

Through that console you get a point-and-click wizard for setting up ingestion

12:37

whether you're loading a huge historical batch file or

12:40

Configuring a continuous stream from Kafka. So it guides you through it

12:43

It does and you can also manage the entire cluster from that console viewing your

12:48

data sources monitoring tasks

12:50

Checking the health of all your services. So management and prototyping are all in

12:54

one place

12:54

Exactly, and the console also has a query workbench. You can prototype and refine

12:59

your druid SQL queries or native queries

13:01

Interactively, it's the perfect little sandbox to see how your data will perform

13:05

before you push those queries into your actual application

13:09

This has been a fascinating deep dive

13:10

so to summarize the key takeaways for you listening Apache druid delivers high

13:14

performance real-time analytics by combining two key things first an

13:19

Optimized data storage. It's always columnar. It's time indexed and it's highly

13:23

compressed

13:23

Second it has a stream native elastic architecture that enables that scatter gather

13:28

query approach and the result of that combination is just

13:32

Massive concurrency and sub second query performance on enormous data sets

13:37

and if we connect this back to the bigger picture the whole architecture of druid

13:42

prioritizing query on arrival and

13:43

millisecond response times it really raises an important question in a world is

13:48

demanding instantaneous data high concurrency

13:50

Dashboards, how much longer will traditional database architectures the ones that

13:55

rely on lengthy ETL processes or pre caching?

13:57

How much longer will they remain competitive for operational apps and dynamic ui's

14:02

so the very definition of fast enough is changing

14:04

It's changing very rapidly a truly provocative thought to mull over if you want to

14:08

explore this any further

14:09

You can check out the quick start the FAQ and all the documentation through the

14:13

Apache druid project

14:15

We really encourage you to dive deeper. Thank you for joining us for this deep dive

14:19

and once again a huge

14:20

Thanks to our sponsor safe server

14:22

Remember safe server cares for the hosting of this type of software and supports

14:26

you in your digital transformation

14:27

www.safeserver.de

14:27

www.safeserver.de