Today's Deep-Dive: Apache Druid

0:00

Welcome back to the deep dive. This is where we take a piece of complex tech

0:04

We peel back all that jargon and we just give you the essentials and today we are

0:08

plunging right into the core of big data

0:11

Focusing on something that's just critical for handling these massive volumes of

0:15

data that are arriving, you know right now

0:17

We're talking about Apache druids

0:19

So if you're out there dealing with just mountains of streaming data or maybe you're

0:23

trying to power a dashboard that has to answer questions

0:26

Instantly, I mean not in minutes but in milliseconds then this deep dive is for you.

0:32

Our mission is pretty simple

0:33

we're going to break down Apache drew this powerful real-time analytics database

0:38

and just give you a really clear entry point a way to

0:40

Quickly grasp what it is how it works and why it's becoming so important in the big

0:44

data world

0:44

We've been digging through the core descriptions the feature lists all the

0:48

community docs and we're ready to well distill that for you

0:51

But first I just want to mention that this deep dive is supported by safe server

0:55

safe server handles the hosting for this type

0:57

Of software and supports you in your digital transformation. You can find out more

1:00

at www.safeserver.de

1:03

That's www.safeserver.de. Right. So let's just start with the fundamental

1:07

definition because Apache drew it isn't a general-purpose tool

1:10

It's very very specific. It's a high-performance

1:13

real-time analytics database and its design goal is incredibly focused

1:19

It's all about delivering sub-second queries on just immense amounts of data sub-second

1:25

Okay. Yeah, and that's on both streaming data that just arrived and you know

1:29

Pedabytes of historical data all at scale and under huge load. Think of it this way

1:34

It's an open-source alternative to your traditional data warehouse

1:37

But it's been optimized specifically for that high-speed interactive analysis

1:42

Oh LAP right online analytical processing. Exactly. It's built to answer those

1:47

complex analytical questions instantly

1:48

Okay, let's unpack that OLAP thing immediately because I mean there are countless

1:52

databases out there

1:53

You've got relational no sequel data warehouses. Why do we need a different kind of

1:57

database like druid just for analytics?

1:59

What's the specific problem here that those other tools can't solve?

2:02

The problem is what you could call the time to insight or you know time to action

2:07

Traditional systems, especially the ones built for transactions OLTP. They're great

2:12

for say recording a single sale

2:13

Right one record at a time exactly and even traditional data warehouses, which are

2:19

good for big complex reports

2:21

They're often optimized for batch loading. That means query times can be slow and

2:25

you're asking ad hoc real-time questions

2:27

Druid is designed for that exact moment when you need a live answer

2:31

Right now so for instance you run a gaming platform and you need to know the total

2:36

spend of users in Berlin in the last minute

2:39

Filtered by their device type Wow. Okay, that's specific it is and it requires

2:43

instantaneous analysis across

2:45

Maybe billions of events druid steps in to make those ultra-fast highly dimensional

2:50

workflows possible

2:52

So give me the wow metric here if I'm used to my reports taking I don't know 30

2:56

seconds to run

2:57

What kind of speed are we really talking about with druid? We are talking about

3:00

true millisecond response times millisecond the documentation

3:04

It consistently highlights its ability to run these really complex high-dimensional

3:09

OLP queries in

3:10

Milliseconds and that's even against datasets that range from billions to get this

3:14

trillions of rows trillions

3:16

Okay, and critically it does this without you having to pre-aggregate the data or

3:20

pre-define the exact query

3:22

You're gonna ask you can just ask anything instantly that speed is impressive

3:26

But that usually implies a single highly optimized query what happens when?

3:32

Say a thousand users hit that same data all at once

3:35

Does the performance just tank and that's where its concurrency power is?

3:40

So essential speed is pretty meaningless if the system just buckles under load

3:43

Druids architecture as well. It's fundamentally built to handle that kind of

3:47

immense concurrent load

3:48

We're talking about supporting anywhere from hundreds to even a hundred thousands

3:52

of queries per second

3:53

QPS a hundred thousand queries per second maintaining that consistent low latency

3:58

performance

3:59

This is precisely why it's used to power consumer facing ui's or internal dashboards

4:04

that you know

4:05

Hundreds of employees are using all day every click is a new query every single

4:08

click on a filter a

4:10

Visualization that requires a fresh lightning fast query

4:14

Druid is built for that exact operational load

4:17

Okay, but achieving that kind of performance usually means you have to throw

4:21

massive expensive hardware at the problem

4:23

Does druid manage to be cost-efficient? It absolutely does and it addresses that

4:28

cost question mainly through resource efficiency

4:31

Because the architecture is so optimized and we'll get into the compression and

4:34

indexing in a bit. It just

4:36

Significantly reduces the total hardware footprint you need compared to other

4:40

solutions. Yeah compared to many competing data warehouses

4:44

it's really designed for massive scale, but with

4:46

Resource conservation built right into its DNA. It makes the total cost of

4:51

ownership much much lower over time

4:53

Okay, so this is where it gets really interesting for me the real-time aspect

4:56

We're in a streaming world now data from logs sensors clicks. It never stops

5:01

So how does druid deal with data that showed up one second ago? This is its core

5:05

differentiator

5:06

True stream ingestion druid is you could say stream native if you've ever worked

5:12

with traditional ETL processes

5:14

You know, there's always a delay the data lands. It gets transformed then it's

5:18

loaded that can take minutes

5:20

Sometimes hours the dreaded batch window right druid just eliminates that wait time

5:25

It has native baked in integration with the big streaming platforms like Apache

5:30

Kafka

5:30

Amazon kinesis this allows it to support query on arrival it ingests and

5:36

Immediately makes millions of events per second available for query immediately

5:40

with guaranteed consistency and ultra low latency

5:43

So if a critical event happens like a security alert or a sudden drop in sales

5:48

I can query that event and act on it within a second not wait for some nightly

5:53

batch process

5:54

That is exactly the use case

5:56

You are querying the live stream at the same time as all the historical data

5:59

You can instantly combine the hot fresh data

6:02

What happened the last minute with a massive cold data from the last five years all

6:06

in a single unified?

6:08

Subsecond query and that gives you the complete picture a truly complete up to the

6:12

second operational picture

6:14

It's essential for things like monitoring and alerting apps. Okay, we've

6:17

established. It's ridiculously fast. It's scalable

6:20

It's stream made of now. I want to look under the hood. How how does it sustain?

6:26

millisecond speed on

6:29

trillions of rows

6:31

What's the secret sauce in the architecture?

6:33

Well, the speeds no accident. It starts the very moment the data is ingested the

6:38

second data touches druid

6:39

It's automatically transformed into what they call an optimized data format

6:43

It's a multi-layered process, but we can simplify it down to say three key

6:47

principles

6:48

Okay, first the data is columnarized. This is a huge departure from traditional row

6:52

based databases

6:53

Imagine your data is a physical library a row based database stores every books

6:58

info author title date subject all

6:59

Bundled together on the shelf. That's a row

7:02

So if I just want to know all the subjects I have to pull every single book record

7:05

off the shelf precisely

7:07

You have to read through everything it takes time at waste operation

7:09

Columnar storage is like taking that library and putting all the authors in one

7:13

aisle all the titles in another and all the subjects in

7:16

A third so if you only need to run a query on the city column

7:20

You only read the city aisle you skip all the other massive amounts of unrelated

7:24

data

7:24

It just dramatically cuts down the IO. Okay, so columnar storage is step one for

7:28

efficiency

7:29

What's next step two is that the data is profoundly optimized for time series

7:34

analysis?

7:36

Specifically through time indexing since most data in druid is time series data

7:41

events happening over time

7:42

It immediately indexes everything based on time. It's like installing a master

7:47

clock in your data

7:48

So you can jump to a specific time range instantly instantly the last week the

7:53

third quarter of 2021

7:54

Whatever without scanning anything irrelevant and the third principle is all about

7:57

compression and encoding

7:58

This is the digital shorthand the data gets highly optimized using techniques like

8:02

dictionary encoding and bitmap indexing

8:05

Okay, that sounds like serious jargon. Can you make that a bit more accessible?

8:08

What's dictionary encoding actually doing for me?

8:10

Of course think of dictionary encoding is creating a little lookup table for any

8:15

repetitive values

8:16

So if you have a column with a million rows that only contain five city names

8:20

New York London Paris, Tokyo Sydney druid

8:24

Doesn't store the full-text New York a million times. That would be crazy

8:29

It would instead it assigns it a tiny numerical code like one it stores that

8:33

dictionary mapping just once and then stores

8:36

Millions of tiny ones in the actual data segment

8:39

Ah, so you save a massive amount of storage space and because the index is so much

8:44

smaller the query engine can search it way faster

8:47

It's looking for tiny numbers not long strings. Exactly. It's an ultra-efficient

8:51

data reduction that directly boosts speed

8:53

The result of all this is that every piece of data is basically a pre-optimized

8:57

ready-to-run package from the second

8:59

It's ingested. Okay, so that's the data structure

9:01

How does the query engine actually use that to get to sub second speeds?

9:04

So the engine uses what's called a scatter gatherer approach

9:07

Yeah, and the core philosophy is don't move the data don't want the data data

9:11

movement is slow. It's expensive

9:13

So druid make sure those optimized data segments are loaded directly into memory or

9:19

local

9:19

SSDs on the specific nodes where they live when you run a query the system

9:25

Intelligently figures out which nodes hold the data you need it scatters the work

9:29

to those nodes

9:29

They process it locally using their own CPU and memory and then they only gather

9:33

the small result sets back

9:35

That dramatically cuts down on network latency

9:38

It does it avoids reading anything extra and keeps the network traffic to an

9:42

absolute minimum

9:43

So instead of hauling the entire library to a central desk for processing

9:48

You just send little instruction cards to the specific aisles and they send back

9:51

only the chapter summaries you asked for that's a perfect analogy

9:55

And that localized processing is just crucial for maintaining performance under all

9:59

that concurrency

10:00

We talked about speaking of concurrency

10:02

Applications scale and you know, they often scale unpredictably if my user base triples

10:07

tomorrow

10:07

Can I scale the system quickly and reliably that brings us to its elastic

10:12

architecture?

10:13

Druid was designed from the ground up to be distributed all its components ingestion

10:18

queries orchestration deep storage

10:20

They're all loosely coupled and why does loose coupling matter?

10:24

It means you can scale the query processing nodes completely independently from the

10:28

ingestion nodes or from the storage layer

10:30

So if you suddenly need to handle ten times the query traffic

10:33

You could just provision and add ten times the query nodes in minutes

10:37

You never have to interrupt the system or stop data ingestion. It's incredibly

10:40

flexible scaling both vertically and horizontally

10:43

Yes, and what about when things go wrong? If a node goes down does my whole

10:48

dashboard crash? Absolutely not

10:50

Reliability is a cornerstone druid has non-stop reliability features built in

10:55

things like continuous backup to deep storage like s3

10:59

Or HDFS automated recovery and multi node replication of data segments

11:03

So there's always a copy always if a query node fails

11:06

Another node that already holds a copy of that data just automatically steps in it

11:10

ensures high availability and durability

11:13

This is all very powerful

11:14

But for the user who needs to adopt this the learning curve can be steep

11:18

How accessible is druid for developers or analysts who are used to more traditional

11:22

tools?

11:23

Accessibility has been a huge focus in its recent development

11:27

I mean for one developers and analysts can use the familiar SQL API for everything

11:32

for everything not just querying for all end-to-end

11:34

Data operations you can use standard SQL for querying sure

11:38

But also for defining ingestion tasks and performing data transformations

11:43

If you know sequel you have a massive head start that lowers the barrier to entry

11:47

significantly

11:47

You don't have to learn some proprietary query language

11:50

What about defining the schema the structure of the incoming data is that a rigid

11:55

manual process?

11:56

That's another big accessibility when schema auto discovery gives you the ease of a

12:01

schemal a system

12:02

but with the performance benefits of a strict schema as

12:05

Data streams in druid can automatically detect define and update column names and

12:10

their data types

12:11

You don't have to stop everything and manually define a hundred columns before you

12:15

can core your data

12:16

It just handles that on the fly. So if I were a new user trying to load my first

12:20

data stream

12:21

What's that actual user experience like am I writing yaml scripts from day one? Not

12:27

at all

12:27

There's a really practical built-in web console. It's designed for easy interaction

12:33

Through that console you get a point-and-click wizard for setting up ingestion

12:37

whether you're loading a huge historical batch file or

12:40

Configuring a continuous stream from Kafka. So it guides you through it

12:43

It does and you can also manage the entire cluster from that console viewing your

12:48

data sources monitoring tasks

12:50

Checking the health of all your services. So management and prototyping are all in

12:54

one place

12:54

Exactly, and the console also has a query workbench. You can prototype and refine

12:59

your druid SQL queries or native queries

13:01

Interactively, it's the perfect little sandbox to see how your data will perform

13:05

before you push those queries into your actual application

13:09

This has been a fascinating deep dive

13:10

so to summarize the key takeaways for you listening Apache druid delivers high

13:14

performance real-time analytics by combining two key things first an

13:19

Optimized data storage. It's always columnar. It's time indexed and it's highly

13:23

compressed

13:23

Second it has a stream native elastic architecture that enables that scatter gather

13:28

query approach and the result of that combination is just

13:32

Massive concurrency and sub second query performance on enormous data sets

13:37

and if we connect this back to the bigger picture the whole architecture of druid

13:42

prioritizing query on arrival and

13:43

millisecond response times it really raises an important question in a world is

13:48

demanding instantaneous data high concurrency

13:50

Dashboards, how much longer will traditional database architectures the ones that

13:55

rely on lengthy ETL processes or pre caching?

13:57

How much longer will they remain competitive for operational apps and dynamic ui's

14:02

so the very definition of fast enough is changing

14:04

It's changing very rapidly a truly provocative thought to mull over if you want to

14:08

explore this any further

14:09

You can check out the quick start the FAQ and all the documentation through the

14:13

Apache druid project

14:15

We really encourage you to dive deeper. Thank you for joining us for this deep dive

14:19

and once again a huge

14:20

Thanks to our sponsor safe server

14:22

Remember safe server cares for the hosting of this type of software and supports

14:26

you in your digital transformation

14:27

www.safeserver.de

14:27

www.safeserver.de

Today's Deep-Dive: Apache Druid

Episode description

Persons