Welcome back to the deep dive. This is where we take a piece of complex tech
We peel back all that jargon and we just give you the essentials and today we are
plunging right into the core of big data
Focusing on something that's just critical for handling these massive volumes of
data that are arriving, you know right now
We're talking about Apache druids
So if you're out there dealing with just mountains of streaming data or maybe you're
trying to power a dashboard that has to answer questions
Instantly, I mean not in minutes but in milliseconds then this deep dive is for you.
Our mission is pretty simple
we're going to break down Apache drew this powerful real-time analytics database
and just give you a really clear entry point a way to
Quickly grasp what it is how it works and why it's becoming so important in the big
data world
We've been digging through the core descriptions the feature lists all the
community docs and we're ready to well distill that for you
But first I just want to mention that this deep dive is supported by safe server
safe server handles the hosting for this type
Of software and supports you in your digital transformation. You can find out more
at www.safeserver.de
That's www.safeserver.de. Right. So let's just start with the fundamental
definition because Apache drew it isn't a general-purpose tool
It's very very specific. It's a high-performance
real-time analytics database and its design goal is incredibly focused
It's all about delivering sub-second queries on just immense amounts of data sub-second
Okay. Yeah, and that's on both streaming data that just arrived and you know
Pedabytes of historical data all at scale and under huge load. Think of it this way
It's an open-source alternative to your traditional data warehouse
But it's been optimized specifically for that high-speed interactive analysis
Oh LAP right online analytical processing. Exactly. It's built to answer those
complex analytical questions instantly
Okay, let's unpack that OLAP thing immediately because I mean there are countless
databases out there
You've got relational no sequel data warehouses. Why do we need a different kind of
database like druid just for analytics?
What's the specific problem here that those other tools can't solve?
The problem is what you could call the time to insight or you know time to action
Traditional systems, especially the ones built for transactions OLTP. They're great
for say recording a single sale
Right one record at a time exactly and even traditional data warehouses, which are
good for big complex reports
They're often optimized for batch loading. That means query times can be slow and
you're asking ad hoc real-time questions
Druid is designed for that exact moment when you need a live answer
Right now so for instance you run a gaming platform and you need to know the total
spend of users in Berlin in the last minute
Filtered by their device type Wow. Okay, that's specific it is and it requires
instantaneous analysis across
Maybe billions of events druid steps in to make those ultra-fast highly dimensional
workflows possible
So give me the wow metric here if I'm used to my reports taking I don't know 30
seconds to run
What kind of speed are we really talking about with druid? We are talking about
true millisecond response times millisecond the documentation
It consistently highlights its ability to run these really complex high-dimensional
OLP queries in
Milliseconds and that's even against datasets that range from billions to get this
trillions of rows trillions
Okay, and critically it does this without you having to pre-aggregate the data or
pre-define the exact query
You're gonna ask you can just ask anything instantly that speed is impressive
But that usually implies a single highly optimized query what happens when?
Say a thousand users hit that same data all at once
Does the performance just tank and that's where its concurrency power is?
So essential speed is pretty meaningless if the system just buckles under load
Druids architecture as well. It's fundamentally built to handle that kind of
immense concurrent load
We're talking about supporting anywhere from hundreds to even a hundred thousands
of queries per second
QPS a hundred thousand queries per second maintaining that consistent low latency
performance
This is precisely why it's used to power consumer facing ui's or internal dashboards
that you know
Hundreds of employees are using all day every click is a new query every single
click on a filter a
Visualization that requires a fresh lightning fast query
Druid is built for that exact operational load
Okay, but achieving that kind of performance usually means you have to throw
massive expensive hardware at the problem
Does druid manage to be cost-efficient? It absolutely does and it addresses that
cost question mainly through resource efficiency
Because the architecture is so optimized and we'll get into the compression and
indexing in a bit. It just
Significantly reduces the total hardware footprint you need compared to other
solutions. Yeah compared to many competing data warehouses
it's really designed for massive scale, but with
Resource conservation built right into its DNA. It makes the total cost of
ownership much much lower over time
Okay, so this is where it gets really interesting for me the real-time aspect
We're in a streaming world now data from logs sensors clicks. It never stops
So how does druid deal with data that showed up one second ago? This is its core
differentiator
True stream ingestion druid is you could say stream native if you've ever worked
with traditional ETL processes
You know, there's always a delay the data lands. It gets transformed then it's
loaded that can take minutes
Sometimes hours the dreaded batch window right druid just eliminates that wait time
It has native baked in integration with the big streaming platforms like Apache
Kafka
Amazon kinesis this allows it to support query on arrival it ingests and
Immediately makes millions of events per second available for query immediately
with guaranteed consistency and ultra low latency
So if a critical event happens like a security alert or a sudden drop in sales
I can query that event and act on it within a second not wait for some nightly
batch process
That is exactly the use case
You are querying the live stream at the same time as all the historical data
You can instantly combine the hot fresh data
What happened the last minute with a massive cold data from the last five years all
in a single unified?
Subsecond query and that gives you the complete picture a truly complete up to the
second operational picture
It's essential for things like monitoring and alerting apps. Okay, we've
established. It's ridiculously fast. It's scalable
It's stream made of now. I want to look under the hood. How how does it sustain?
millisecond speed on
trillions of rows
What's the secret sauce in the architecture?
Well, the speeds no accident. It starts the very moment the data is ingested the
second data touches druid
It's automatically transformed into what they call an optimized data format
It's a multi-layered process, but we can simplify it down to say three key
principles
Okay, first the data is columnarized. This is a huge departure from traditional row
based databases
Imagine your data is a physical library a row based database stores every books
info author title date subject all
Bundled together on the shelf. That's a row
So if I just want to know all the subjects I have to pull every single book record
off the shelf precisely
You have to read through everything it takes time at waste operation
Columnar storage is like taking that library and putting all the authors in one
aisle all the titles in another and all the subjects in
A third so if you only need to run a query on the city column
You only read the city aisle you skip all the other massive amounts of unrelated
data
It just dramatically cuts down the IO. Okay, so columnar storage is step one for
efficiency
What's next step two is that the data is profoundly optimized for time series
analysis?
Specifically through time indexing since most data in druid is time series data
events happening over time
It immediately indexes everything based on time. It's like installing a master
clock in your data
So you can jump to a specific time range instantly instantly the last week the
third quarter of 2021
Whatever without scanning anything irrelevant and the third principle is all about
compression and encoding
This is the digital shorthand the data gets highly optimized using techniques like
dictionary encoding and bitmap indexing
Okay, that sounds like serious jargon. Can you make that a bit more accessible?
What's dictionary encoding actually doing for me?
Of course think of dictionary encoding is creating a little lookup table for any
repetitive values
So if you have a column with a million rows that only contain five city names
New York London Paris, Tokyo Sydney druid
Doesn't store the full-text New York a million times. That would be crazy
It would instead it assigns it a tiny numerical code like one it stores that
dictionary mapping just once and then stores
Millions of tiny ones in the actual data segment
Ah, so you save a massive amount of storage space and because the index is so much
smaller the query engine can search it way faster
It's looking for tiny numbers not long strings. Exactly. It's an ultra-efficient
data reduction that directly boosts speed
The result of all this is that every piece of data is basically a pre-optimized
ready-to-run package from the second
It's ingested. Okay, so that's the data structure
How does the query engine actually use that to get to sub second speeds?
So the engine uses what's called a scatter gatherer approach
Yeah, and the core philosophy is don't move the data don't want the data data
movement is slow. It's expensive
So druid make sure those optimized data segments are loaded directly into memory or
local
SSDs on the specific nodes where they live when you run a query the system
Intelligently figures out which nodes hold the data you need it scatters the work
to those nodes
They process it locally using their own CPU and memory and then they only gather
the small result sets back
That dramatically cuts down on network latency
It does it avoids reading anything extra and keeps the network traffic to an
absolute minimum
So instead of hauling the entire library to a central desk for processing
You just send little instruction cards to the specific aisles and they send back
only the chapter summaries you asked for that's a perfect analogy
And that localized processing is just crucial for maintaining performance under all
that concurrency
We talked about speaking of concurrency
Applications scale and you know, they often scale unpredictably if my user base triples
tomorrow
Can I scale the system quickly and reliably that brings us to its elastic
architecture?
Druid was designed from the ground up to be distributed all its components ingestion
queries orchestration deep storage
They're all loosely coupled and why does loose coupling matter?
It means you can scale the query processing nodes completely independently from the
ingestion nodes or from the storage layer
So if you suddenly need to handle ten times the query traffic
You could just provision and add ten times the query nodes in minutes
You never have to interrupt the system or stop data ingestion. It's incredibly
flexible scaling both vertically and horizontally
Yes, and what about when things go wrong? If a node goes down does my whole
dashboard crash? Absolutely not
Reliability is a cornerstone druid has non-stop reliability features built in
things like continuous backup to deep storage like s3
Or HDFS automated recovery and multi node replication of data segments
So there's always a copy always if a query node fails
Another node that already holds a copy of that data just automatically steps in it
ensures high availability and durability
This is all very powerful
But for the user who needs to adopt this the learning curve can be steep
How accessible is druid for developers or analysts who are used to more traditional
tools?
Accessibility has been a huge focus in its recent development
I mean for one developers and analysts can use the familiar SQL API for everything
for everything not just querying for all end-to-end
Data operations you can use standard SQL for querying sure
But also for defining ingestion tasks and performing data transformations
If you know sequel you have a massive head start that lowers the barrier to entry
significantly
You don't have to learn some proprietary query language
What about defining the schema the structure of the incoming data is that a rigid
manual process?
That's another big accessibility when schema auto discovery gives you the ease of a
schemal a system
but with the performance benefits of a strict schema as
Data streams in druid can automatically detect define and update column names and
their data types
You don't have to stop everything and manually define a hundred columns before you
can core your data
It just handles that on the fly. So if I were a new user trying to load my first
data stream
What's that actual user experience like am I writing yaml scripts from day one? Not
at all
There's a really practical built-in web console. It's designed for easy interaction
Through that console you get a point-and-click wizard for setting up ingestion
whether you're loading a huge historical batch file or
Configuring a continuous stream from Kafka. So it guides you through it
It does and you can also manage the entire cluster from that console viewing your
data sources monitoring tasks
Checking the health of all your services. So management and prototyping are all in
one place
Exactly, and the console also has a query workbench. You can prototype and refine
your druid SQL queries or native queries
Interactively, it's the perfect little sandbox to see how your data will perform
before you push those queries into your actual application
This has been a fascinating deep dive
so to summarize the key takeaways for you listening Apache druid delivers high
performance real-time analytics by combining two key things first an
Optimized data storage. It's always columnar. It's time indexed and it's highly
compressed
Second it has a stream native elastic architecture that enables that scatter gather
query approach and the result of that combination is just
Massive concurrency and sub second query performance on enormous data sets
and if we connect this back to the bigger picture the whole architecture of druid
prioritizing query on arrival and
millisecond response times it really raises an important question in a world is
demanding instantaneous data high concurrency
Dashboards, how much longer will traditional database architectures the ones that
rely on lengthy ETL processes or pre caching?
How much longer will they remain competitive for operational apps and dynamic ui's
so the very definition of fast enough is changing
It's changing very rapidly a truly provocative thought to mull over if you want to
explore this any further
You can check out the quick start the FAQ and all the documentation through the
Apache druid project
We really encourage you to dive deeper. Thank you for joining us for this deep dive
and once again a huge
Thanks to our sponsor safe server
Remember safe server cares for the hosting of this type of software and supports
you in your digital transformation
www.safeserver.de
www.safeserver.de