When you’re trying to analyze big data, you can run into some tough queries that just don’t scale well because they need a ton of computing power and time to spit out precise results. Some examples include things like:
- Counting distinct values: How many unique users visited the website last month? How many different products were sold last quarter?
- Finding quantiles: What is the 90th percentile response time for API requests? Or what are the median and 75th percentile values for sales transactions in the last month?
That’s where approximations can be helpful. By using a set of tools called “sketches,” you can generate results much faster while still getting a solid idea of how accurate they are with predictable error margins. There are different types depending on what you’re using them form For example:
- Count Distinct (HLL Sketch): Counting distinct values using HyperLogLog (HLL) sketch can be up to 100 times faster than exact counting. This is because HLL approximates the count with a small, fixed amount of memory.
- Quantiles (KLL Sketch): Finding quantiles using the KLL sketch can be up to 10-50 times faster than exact methods, depending on the dataset size and the specific quantiles being calculated. KLL sketch maintains an approximate distribution of the data, allowing fast quantile calculations.
When you need quick answers for interactive queries, data sketches might be your best bet. And for real-time analysis, they’re essential.
In Apache Druid, sketches can be built from raw data at ingestion time or at query time. Apache Druid 29.0.0 included two community extensions that enhance data accuracy at the extremes of statistical distributions — areas where traditional methods often falter. With these extensions, you can achieve higher accuracy without incurring extra costs associated with storage. Let’s look at Spectator Histogram (contributed by Netflix) and DDSketch (contributed by Kong).
More Efficient Storage, Same Cluster Footprint: Netflix’s SpectatorHistogram Extension
Spectator Histogram allows you to run quantile calculations on positive integer values while using less storage space (depending on your data modeling) than the data sketches and is optimized for typical measurements from cloud services and web apps, such as page load time, transferred bytes, response time, and request latency. If this sounds familiar to you, it might be because Ben Sykes, software engineer at Netflix, has given two very thorough presentations on Spectator Histogram at Druid Summit 2022 and 2023.
Spectator Histogram has a more compact storage size compared to traditional data sketches. By storing entire histograms concisely in a single metric column, we can benefit from reduced data size in Druid without compromising accuracy.
“We [Netflix] were looking at a way to make that more efficient, because these percentile queries then had to be a GROUP BY, and they were just very expensive queries to run. We started investing data sketches, but that would involve changing how the queries worked, because a lot of the computation for the percentiles happens in this bridging layer. It effectively asks for the raw histogram values and then computes the percentiles for us, whereas data sketches really want to compute the percentiles for you in Druid.
When we did these comparisons, that data sketches are themselves quite large, so it was actually using significantly more storage even to represent them as data sketches, compared to even our exploded out dimension methodology. It was kind of like a non-starter for us at that point. We’re already pushing the bounds of what we can keep in memory on this cluster, so we don’t want to make it just more expensive.”
The ability to compress values into a few bytes, coupled with efficient aggregation, results in a significant improvement in storage utilization while maintaining performance parity with other data sketches. This advancement opens new possibilities for handling large datasets within constrained memory environments.
“We can effectively say that we can store more data in faster storage with the same cluster footprint. So from that perspective, if you’re using a lot of percentile queries, then it is more performant.”
Be aware of the specific use cases where this extension works best. The primary advantage lies in storage optimization, making it a good choice for scenarios requiring efficient data representation. However, Spectator Histogram requires positive integer values, limiting their applicability in certain data contexts.
The decision to integrate Spectator Histogram into the latest Druid release was driven by a number of factors, including stability, feature completeness, and the need for a formal release milestone. While the extension has undergone rigorous testing and validation, future iterations may introduce additional functionalities such as SQL support.
True Value Error Over Rank Error with DDSketch
In the same release, Kong contributed the DDSketch community extension to Apache Druid. Compared to quantile sketches, DDSketch offers higher accuracy at the ends of the quantiles. Specifically, this makes it accurate to calculate P90 and P10 values with higher accuracy with less K tuning.
“There was an issue where somebody else had asked if this existed and it didn’t get any traction. The fixed bucket histogram was declared dead, and DDSketch is honestly just a clever histogram where you get to just find your buckets. And so we thought, well, we should give it back [to the community]” – Hiroshi Fukada, Staff Software Engineer at Kong
The motivation behind the development of the DDSketch extension stemmed from the limitations of existing quantile sketches in accurately capturing high-percentile values of network latencies. While the native quantile sketches in Druid performed well for lower and mid-range quantiles, they fell short in providing accurate measurements for critical latency percentiles such as P95, P99, and beyond. This discrepancy prompted Kong to seek a more stable, memory-efficient, and accurate solution to meet the demands of their high-cardinality data and customer requirements.
“We wanted a stable algorithm that was fast, memory-capped and accurate, with some guarantees and parameters for size. And then we came across the DDSketch paper that Datadog published and poked around to see if Druid had anything like it. But the previous sketches that were baked in all had rank error guarantees, which we were not happy with because of the long tail nature of network latencies. We didn’t want to use something that had rank error. We wanted to use something that had true value error. If your request times are 10 seconds, it doesn’t matter if that answer is 10.1 seconds or 10.9 seconds. We wanted that guarantee.”
A significant achievement of the DDSketch extension is its substantial improvement in performance metrics compared to the previous quantile sketches. Kong reported 5x reduction in size, a 2x to 3x increase in speed performance, and enhanced memory efficiency. These enhancements translate to significant savings in storage, compute resources, and memory utilization in their Druid cluster, making latency measurement more efficient and reliable for high-volume data processing scenarios.
The DDSketch extension introduces a novel approach to latency measurement by prioritizing true value error over rank error. By focusing on the actual value of latency measurements rather than their relative ranking, DDSketch ensures that deviations in latency values are accurately captured, regardless of their magnitude.
Looking ahead, Kong aims to further optimize the DDSketch extension by exploring strategies to enhance its compactness in size. By leveraging alternative data serialization methods such as hash tables, Kong hopes to fine-tune DDSketch for scenarios with highly dispersed high-cardinality data, striking a balance between data size efficiency and compatibility with existing serialization frameworks. These ongoing refinements underscore Kong’s commitment to continuous improvement and innovation in latency measurement capabilities within the Druid ecosystem.
Ready to try these out for yourself?
For a full list of the latest functionality in Apache Druid, head over to the Apache Druid download page and documentation for more on the two community extensions mentioned in this post: