Introducing incremental encoding for Apache Druid dictionary encoded columns
Sep 27, 2023
Clint Wylie
Query performance at scale is one of the most important areas of Apache Druid to work on; it is what allows you to create interactive UIs that can slice and dice billions to trillions of rows in real-time. It is one of the primary focus areas for any change to query processing, where squeezing additional milliseconds here and there off of processing time can add up to quite a large improvement. It’s very satisfying work, though always a bit risky since sometimes experiments do not pay off as we hope.
In this post I am going to deep dive on a recent engineering effort: incremental encoding of STRING columns. In preliminary testing, it has shown to be quite promising at significantly reducing the size of segments, making more effective use of the same amount of hardware, at little to no performance cost.
A quick introduction to segments
Segment files
At ingestion time, Druid partitions table data into files called segments, first based on time, and then optionally by ranges or hashes of values contained in the input rows.
Segments are distributed among data nodes for query processing ahead of time, where they are memory mapped to provide an efficient way for the query processing engine to access data during execution.
Memory mapping
Memory mapping in the host operating system ‘maps’ the bytes of a file to a continuous virtual memory range, making the file appear as if it were just another chunk of memory. When data is read, it is ‘paged’ into physical memory in a cache that resides in the ‘free space’. It stays here, allowing re-use if it is read again, or if all the space is occupied is automatically evicted if other segments are read, or if other processes need to use the space.
The effects are particularly noticeable when there are repeated reads of the same data: parts of the files are retained in the page cache and memory read speeds are typically still an order of magnitude higher than disk read speeds. The ‘ideal’ historical process would have available as much free memory space as has been allocated for segment cache meaning all segments it is serving could be stored in this page cache.
In practice this isn’t usually feasible, so the ratio of free space to disk cache is typically chosen to best fit the use case and desired performance characteristics of the cluster.
Dictionary encoding
As a columnar format, segment files consist primarily of contiguous regions corresponding to the parts of individual columns.
A string-type column typically consists of 3 parts.
Since string value sizes vary from row to row, dictionary encoding is employed, where each distinct value is placed into a sorted dictionary and assigned an integer identifier representing its position in the sorted set of values. The actual column part we store in the segment then consists of a series of integers that map to the values in the dictionary, rather than the values themselves.
Besides the integer dictionary id column and the value dictionary, the third part consists of bitmap value indexes ordered identically to the value dictionary, where the set bits of the bitmap correspond to which rows of the integer column contain that dictionary identifier.
We store things like this for a few reasons. First, using dictionary encoding makes random access of individual rows an efficient computation. Integers are a fixed width, so we can compute the offset in the integer column for a specific row in code simply by multiplying the number of bytes per value by the row number.
The integer column parts are stored in fixed size compressed blocks and use a byte packing strategy to use the smallest number of bytes which can represent all values, so we can also compute which block a particular row belongs to in order to decompress it, and then seek to the desired row.
With this in mind, the value dictionary itself also needs to have three important characteristics to perform well for both value lookup and filtering:
Random ordered value lookup of the integer identifiers must be fast. This happens when the column itself is being selected and will appear in the query results, where the integer identifiers are replaced with their actual value in row order.
Druid needs to be able to quickly reverse lookup the dictionary identifier from an actual value, since filtering a column needs to be able to rapidly check if the value is contained within the dictionary. Certain types of filters, such as ranges, can use the fact that the dictionary is sorted to quickly find all of the values which fall into the desired range to select their associated value bitmap indexes. Since the values are sorted, we can do this with a binary search.
The dictionary must be easy to iterate over the entire set of values efficiently. Filters that cannot exploit features of the dictionary being sorted instead rely on being able to iterate over the entire set of values to test if each value satisfies the predicate to match the filter, computing the set of values which we can utilize the indexes to prune the overall set of rows to be scanned.
A new approach: front coding
Maximum performance for segment processing is achieved when a large number of segment files are kept ‘hot’ in the page cache in order to avoid reading from disk. Thus, nearly anything we can do to shrink overall segment size can have sizable macro effects: with smaller segments, larger number of rows to reside in the same amount of page cache.
A lot of work is already done by Druid already to make string columns small. It compresses the dictionary id integer column using LZ4, indexes are compressed using Roaring, however dictionary values are stored as plain uncompressed UTF8 strings.
The introduction of nested columns to Druid, which use value dictionaries that are shared between the nested fields, was the catalyst to investigate strategies to shrink the value dictionary overall. Could we add a technique to save some extra space?
Because of this shared nature, the dictionaries of nested columns can have significantly higher cardinality than regular strings, making them even more likely to take advantage of any size improvements we might be able to uncover. This topic is worth a full discussion of its own, but I’ll save that for another day.
Back in 2017, Roman Leventov proposed using incremental encoding to store string value dictionaries, so I started there. Looking into the issue and reading the associated papers, I set off to explore using an incremental encoding strategy called ‘front-coding’, which involves tracking the length of common prefixes of values so that only the suffix needs to be stored.
To retain our desired dictionary properties discussed earlier, we divide the values into buckets which contain a fixed number of values. Fixed buckets allow Druid to quickly determine which bucket a dictionary id will be stored in, while still allowing it to perform binary search by searching among the first values of each bucket and then performing a linear scan to locate a value within the bucket, and everything remains easy to iterate for those use cases involving iteration.
Implementation details
My first prototype of a front-coded value dictionary (‘V0’) in 2022 deviated a bit from the standard representation. Values were stored in the buckets as a prefix on the first value of the bucket rather than the immediately preceding value because it was a lot easier to implement, and I needed to prove if performance was competitive with the standard UTF8 encoded implementation. After proving feasibility, a better ‘V1’ was added that switched to storing the values in a more typical manner, where the values are stored based on the prefix of the preceding value.
Digging down to the code level, Druid defines an interface called Indexed which is used as the definition for value dictionary implementations (among other things). So, the primary code change was adding a new FrontCodedIndexed, containing the methods for reading and writing the buckets of values. We adapted ‘variable byte’ encoding for integer values from JavaFastPFOR to save a little extra space when writing both string value lengths as well as prefix lengths. Since these integers had no need for random access since they were embedded in the bucket values, each value having a variable size was not problematic.
Front coded indexed layout:
version
bucket size
has null?
number of values
size of “offsets” + “buckets”
“offsets”
“buckets”
byte
byte
byte
vbyte int
vbyte int
int[]
bucket[]
The ‘offsets’ store the starting location of all buckets beyond the first bucket (whose offset is known to be the end of the ‘offsets’ position). The ‘offsets’ are stored as plain integer values instead of vbyte encoded to allow for fast access of the bucket positions, but are probably a good candidate for delta encoded byte packing to further decrease their size.
bucket layout:
first value
prefix length
fragment
…
prefix length
fragment
blob
vbyte int
blob
…
vbyte int
blob
blob layout:
blob length
blob bytes
vbyte int
byte[]
Getting a value first picks the appropriate bucket, finds its offset in the underlying buffer, then scans the bucket values to seek to the correct position of the value within the bucket in order to reconstruct it using the prefix lengths, working backwards through the bucket until the value is completed.
Finding the index of a value involves binary searching the first values of each bucket to find the correct bucket, then a linear scan within the bucket to find the matching value (or negative insertion point -1 for values that are not present). As mentioned above, the binary search behavior of returning the actual position or theoretical insertion point is used to help find values which fall within a range to use with range based filters.
The value iterator reads an entire bucket at a time, reconstructing the values into an array to iterate within the bucket before moving onto the next bucket as the iterator is consumed.
Performance and storage efficiency analysis
Since front-coding is a ‘value aware’ encoding, the size difference can vary quite a bit depending on the underlying source data. Data with large common prefixes, such as URLs and file paths, can have very dramatic space savings: up to 40% in some cases..
The benefit varies with other kinds of data. UUIDs and other very unique data sets see as little as 1% difference. Very low cardinality columns also see only minor benefits, and when the cardinality is lower than the bucket count, it can be even worse than using plain UTF8.
Using the classic Druid example of Wikipedia edits, the size savings of even using version 0 of the format were immediately obvious:
The step up from version 0 to version 1 gave small, but noticeable improvements, too – even in this very small example data set.
In terms of read performance, the difference between ‘UTF8’ and ‘frontCoded’ encoding is negligible, down to a handful of nanoseconds depending on the operation. In some cases things are even faster due to the smaller overall size, not to mention the time spent optimizing this implementation to minimize the amount of data that actually needs to be read: the front-coded implementation defers translating the raw bytes into Java strings until it is absolutely necessary.
The complete set of benchmarks is available in the PR descriptions for v0 and v1. These focus on the micro level, measuring raw speeds for individual queries in the ‘hot’ case where data is cached in memory. What these benchmarks do not adequately capture are the macro effects that overall smaller segment sizes provide, more rows are able to fit in the same amount of memory, which should reduce overall disk reads. Smaller overall segments also allow us to fit more data in the same amount of compute hardware.
Trying it for yourself
A new property stringDictionaryEncoding is available in the tuningConfig’s indexSpec.
Here are some examples of setting stringDictionaryEncoding, using ‘version 1’ with bucket sizes 4 or 16 respectively.:
Take care when applying this new encoding: segments written like this will be unreadable by Druid versions older than 26.0.0. That is part of the reason why this mode is not the default yet.
What’s next?
Front-coding seems to provide a nice improvement to overall segment sizes in Apache Druid. I am proposing making front encoding the default in Druid in time for Druid 28, where backwards compatibility concerns will be resolved at least for operators migrating from Druid 26 or 27. Before this change is merged, I also intend on allowing defining a system default indexSpec, so the behavior can be opted out of at a system level.
There is also a lot of room left for future work to continue iterating and improving the Druid segment format to get greater performance release-by-release.
There’s a new version of the original paper linked in PR 3922. Together with another paper from 2020, both of which detail additional (much fancier) improvements like Re-Pair, which uses an additional lookup to replace repeated fragments with smaller symbols.
I – and the rest of the project – are looking forward to investigating these techniques to determine at what cost additional size improvements can be gained.
Additionally, it seems to be ideal to be able to vary which encoding and with what parameters are used on a per column basis instead of setting it at the segment level. This is actually true for the other forms of compression as well, and both could be addressed as a follow-up improvement. To fully automate this, it would also entail collecting statistics at indexing time, so that we could better determine the best storage strategy using inputs such as value cardinality, value distribution, etc.Try it out in your clusters, I’d love feedback about where it works and where it doesn’t to help guide the parts I focus on next. You can find me on the Apache Druid slack in the #dev channel, or start a thread on the dev mailing list, dev@druid.apache.org.
Other blogs you might find interesting
No records found...
Nov 15, 2023
Introducing Apache Druid 28.0.0
Apache Druid 28.0, an open-source database for real-time analytics, introduces Async queries, UNION ALL support, SQL WINDOW functions, enhanced ingestion features, including multi-Kafka topic support, and...
This blog covers the rationale, advantages, and step-by-step process for data transfer from AWS s3 to Apache Druid for faster real-time analytics and querying.
What’s new in Imply Polaris, our real-time analytics DBaaS – September 2023
Every week, we add new features and capabilities to Imply Polaris. Throughout September, we've focused on enhancing your experience as you explore trials, navigate data integration, oversee data management,...
Migrate Analytics Data from MongoDB to Apache Druid
This blog presents a concise guide on migrating data from MongoDB to Druid. It includes Python scripts to extract data from MongoDB, save it as CSV, and then ingest it into Druid. It also touches on maintaining...
How Druid Facilitates Real-Time Analytics for Mass Transit
Mass transit plays a key role in reimagining life in a warmer, more densely populated world. Learn how Apache Druid helps power data and analytics for mass transit.
Migrate Analytics Data from Snowflake to Apache Druid
This blog outlines the steps needed to migrate data from Snowflake to Apache Druid, a platform designed for high-performance analytical queries. The article covers the migration process, including Python scripts...
Apache Kafka, Flink, and Druid: Open Source Essentials for Real-Time Data Applications
Apache Kafka, Flink, and Druid, when used together, create a real-time data architecture that eliminates all these wait states. In this blog post, we’ll explore how the combination of these tools enables...
Visualizing Data in Apache Druid with the Plotly Python Library
In today's data-driven world, making sense of vast datasets can be a daunting task. Visualizing this data can transform complicated patterns into actionable insights. This blog delves into the utilization of...
Bringing Real-Time Data to Solar Power with Apache Druid
In a rapidly warming world, solar power is critical for decarbonization. Learn how Apache Druid empowers a solar equipment manufacturer to provide real-time data to users, from utility plant operators to homeowners
When to Build (Versus Buy) an Observability Application
Observability is the key to software reliability. Here’s how to decide whether to build or buy your own solution—and why Apache Druid is a popular database for real-time observability
How Innowatts Simplifies Utility Management with Apache Druid
Data is a key driver of progress and innovation in all aspects of our society and economy. By bringing digital data to physical hardware, the Internet of Things (IoT) bridges the gap between the online and...
Three Ways to Use Apache Druid for Machine Learning Workflows
An excellent addition to any machine learning environment, Apache Druid® can facilitate analytics, streamline monitoring, and add real-time data to operations and training
Apache Druid® is an open-source distributed database designed for real-time analytics at scale. Apache Druid 27.0 contains over 350 commits & 46 contributors. This release's focus is on stability and scaling...
Unleashing Real-Time Analytics in APJ: Introducing Imply Polaris on AWS AP-South-1
Imply, the company founded by the original creators of Apache Druid, has exciting news for developers in India seeking to build real-time analytics applications. Introducing Imply Polaris, a powerful database-as-a-Service...
In this guide, we will walk you through creating a very simple web app that shows a different embedded chart for each user selected from a drop-down. While this example is simple it highlights the possibilities...
Automate Streaming Data Ingestion with Kafka and Druid
In this blog post, we explore the integration of Kafka and Druid for data stream management and analysis, emphasizing automatic topic detection and ingestion. We delve into the creation of 'Ingestion Spec',...
This guide explores configuring Apache Druid to receive Kafka streaming messages. To demonstrate Druid's game-changing automatic schema discovery. Using a real-world scenario where data changes are handled...
Imply Polaris, our ever-evolving Database-as-a-Service, recently focused on global expansion, enhanced security, and improved data handling and visualization. This fully managed cloud service, based on Apache...
Introducing hands-on developer tutorials for Apache Druid
The objective of this blog is to introduce the new set of interactive tutorials focused on the Druid API fundamentals. These tutorials are available as Jupyter Notebooks and can be downloaded as a Docker container.
In this blog article I’ll unpack schema auto-discovery, a new feature now available in Druid 26.0, that enables Druid to automatically discover data fields and data types and update tables to match changing...
Druid now has a new function, Unnest. Unnest explodes an array into individual elements. This blog contains design methodology and examples for this new Unnest function both from native and SQL binding perspectives.
What’s new in Imply Polaris – Our Real-Time Analytics DBaaS
Every week we add new features and capabilities to Imply Polaris. This month, we’ve expanded security capabilities, added new query functionality, and made it easier to monitor your service with your preferred...
Apache Druid® 26.0, an open-source distributed database for real-time analytics, has seen significant improvements with 411 new commits, a 40% increase from version 25.0. The expanded contributor base of 60...
How to Build a Sentiment Analysis Application with ChatGPT and Druid
Leveraging ChatGPT for sentiment analysis, when combined with Apache Druid, offers results from large data volumes. This integration is easily achievable, revealing valuable insights and trends for businesses...
In this blog, we will compare Snowflake and Druid. It is important to note that reporting data warehouses and real-time analytics databases are different domains. Choosing the right tool for your specific requirements...
Learn how to achieve sub-second responses with Apache Druid
Learn how to achieve sub-second responses with Apache Druid. This article is an in-depth look at how Druid resolves queries and describes data modeling techniques that improve performance.
Apache Druid uses load rules to manage the ageing of segments from one historical tier to another and finally to purge old segments from the cluster. In this article, we’ll show what happens when you make...
Real-Time Analytics: Building Blocks and Architecture
This blog identifies the key technical considerations for real-time analytics. It answers what is the right data architecture and why. It spotlights the technologies used at Confluent, Reddit, Target and 1000s...
What’s new in Imply Polaris – Our Real-Time Analytics DBaaS
This blog explains some of the new features, functionality and connectivity added to Imply Polaris over the last two months. We've expanded ingestion capabilities, simplified operations and increased reliability...
Wow, that was easy – Up and running with Apache Druid
The objective of this blog is to provide a step-by-step guide on setting up Druid locally, including the use of SQL ingestion for importing data and executing analytical queries.
Tales at Scale Podcast Kicks off with the Apache Druid Origin Story
Tales at Scale cracks open the world of analytics projects and shares stories from developers and engineers who are building analytics applications or working within the real-time data space. One of the key...
Real-time Analytics Database uses partitioning and pruning to achieve its legendary performance
Apache Druid uses partitioning (splitting data) and pruning (selecting subset of data) to achieve its legendary performance. Learn how to use the CLUSTERED BY clause during ingestion for performance and high...
Easily embed analytics into your own apps with Imply’s DBaaS
This blog explains how developers can leverage Imply Polaris to embed robust visualization options directly into their own applications without them having to build a UI. This is super important because consuming...
Building an Event Analytics Pipeline with Confluent Cloud and Imply’s real time DBaaS, Polaris
Learn how to set up a pipeline that generates a simulated clickstream event stream and sends it to Confluent Cloud, processes the raw clickstream data using managed ksqlDB in Confluent Cloud, delivers the processed...
We are excited to announce the availability of Imply Polaris in Europe, specifically in AWS eu-central-1 region based in Frankfurt. Since its launch in March 2022, Imply Polaris, the fully managed Database-as-a-Service...
Should You Build or Buy Security Analytics for SecOps?
When should you build—or buy—a security analytics platform for your environment? Here are some common considerations—and how Apache Druid is the ideal foundation for any in-house security solution.
Combating financial fraud and money laundering at scale with Apache Druid
Learn how Apache Druid enables financial services firms and FinTech companies to get immediate insights from petabytes-plus data volumes for anti-fraud and anti-money laundering compliance.
This is a what's new to Imply in Dec 2022. We’ve added two new features to Imply Polaris to make it easier for your end users to take advantage of real-time insights.
Imply Pivot delivers the final mile for modern analytics applications
This blog is focused on how Imply Pivot delivers the final mile for building an anlaytics app. It showcases two customer examples - Twitch and ironsource.
For decades, analytics has been defined by the standard reporting and BI workflow, supported by the data warehouse. Now, 1000s of companies are realizing an expansion of analytics beyond reporting, which requires...
Apache Druid is at the heart of Imply. We’re an open source business, and that’s why we’re committed to making Druid the best open source database for modern analytics applications
When it comes to modern data analytics applications, speed is of the utmost importance. In this blog we discuss two approximation algorithms which can be used to greatly enhance speed with only a slight reduction...
The next chapter for Imply Polaris: celebrating 250+ accounts, continued innovation
Today we announced the next iteration of Imply Polaris, the fully managed Database-as-a-Service that helps you build modern analytics applications faster, cheaper, and with less effort. Since its launch in...
We obviously talk a lot about #ApacheDruid on here. But what are folks actually building with Druid? What is a modern analytics application, exactly? Let's find out
Elasticity is important, but beware the database that can only save you money when your application is not in use. The best solution will have excellent price-performance under all conditions.
Druid 0.23 – Features And Capabilities For Advanced Scenarios
Many of Druid’s improvements focus on building a solid foundation, including making the system more stable, easier to use, faster to scale, and better integrated with the rest of the data ecosystem. But for...
Apache Druid 0.23.0 contains over 450 updates, including new features, major performance enhancements, bug fixes, and major documentation improvements.
Imply Polaris is a fully managed database-as-a-service for building realtime analytics applications. John is the tech lead for the Polaris UI, known internally as the Unified App. It began with a profound question:...
There is a new category within data analytics emerging which is not centered in the world of reports and dashboards (the purview of data analysts and data scientists), but instead centered in the world of applications...
We are in the early stages of a stream revolution, as developers build modern transactional and analytic applications that use real-time data continuously delivered.
Developers and architects must look beyond query performance to understand the operational realities of growing and managing a high performance database and if it will consume their valuable time.
Building high performance logging analytics with Polaris and Logstash
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
Horizontal scaling is the key to performance at scale, which is why every database claims this. You should investigate, though, to see how much effort it takes, especially compared to Apache Druid.
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
Building Analytics for External Users is a Whole Different Animal
Analytics aren’t just for internal stakeholders anymore. If you’re building an analytics application for customers, then you’re probably wondering…what’s the right database backend?
After over 30 years of working with data analytics, we’ve been witness (and sometimes participant) to three major shifts in how we find insights from data - and now we’re looking at the fourth.
Every year industry pundits predict data and analytics becoming more valuable the following year. But this doesn’t take a crystal ball to predict. There’s instead something much more interesting happening...
Today, I'm prepared to share our progress on this effort and some of our plans for the future. But before diving further into that, let's take a closer look at how Druid's core query engine executes queries,...
Product Update: SSO, Cluster level authorization, OAuth 2.0 and more security features
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
Druid Nails Cost Efficiency Challenge Against ClickHouse & Rockset
To make a long story short, we were pleased to confirm that Druid is 2 times faster than ClickHouse and 8 times faster than Rockset with fewer hardware resources!.
Unveiling Project Shapeshift Nov. 9th at Druid Summit 2021
There is a new category within data analytics emerging which is not centered in the world of reports and dashboards (the purview of data analysts and data scientists), but instead centered in the world of applications...
How we made long-running queries work in Apache Druid
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
Uneven traffic flow in streaming pipelines is a common problem. Providing the right level of resources to keep up with spikes in demand is a requirement in order to deliver timely analytics.