Column stores such as Druid have the ability to store data in individual columns instead of rows, and can apply different compression algorithms to different types of columns. With the Druid 0.9.2 release, Druid has added additional column compression methods for longs to significantly improve query performance in certain use cases. In this blog post, we’ll highlight how these various compression methods impact data storage size and query performance.
To illustrate these new methods, let’s first consider the example event data set shown below:
This data set consists of a timestamp column, a set of string columns (User, Country) that queries frequently group and filter on called “dimensions”, and a set of numeric columns (Added, Deleted, and Delta) that queries often scan and aggregate called “metrics”. In previous literature, we’ve covered what compression and indexing algorithms we apply to dimensions. In this blog post, we’ll focus on our recent improvements for long columns.
Existing Compression Strategies
In our example above, each long takes 8 bytes to store. For a 5,000,000 row
table, an uncompressed column would require roughly 40MB of space. Prior to
0.9.2, Druid only leveraged
lzf compression for longs.
Compression is done by dividing a column into blocks of a given size (currently 0x10000 bytes, or 8192 longs). Each block is compressed and persisted to disk. These compressed columns are memory-mapped, so to read a value in a column, the block that contains the value must be loaded into memory and decompressed.
An inefficiency arises in that we must decompress the entire block even if we only need to read a small number of values within a block, for example, if we are filtering the data. The overhead for decompressing and copying a block from disk to memory is the same as if you were to read all the values in the block.
A natural improvement here is to use a compression technique that allows for direct access from file so we don’t need to do unnecessary block decompression and byte copying to access sparse data.
Direct Access Strategies
To allow direct data access, an important feature we must have is the ability
to quickly access the block of data that contains the values we seek, or
otherwise known as an index into our data. A simple way to achieve this
requirement is to ensure all blocks, and hence, all values in the blocks, are
the same length. However, long values don’t always require 8 bytes for storage.
Using 8 bytes to store a value such as 0x1 is waste. Thus, to efficiently store
values, we’ve adopted two strategies from Apache Lucene:
compression. Delta compression finds the smallest value in a set of longs and
stores every other value as offset to the smallest value, while table
compression maps all unique values to an index and stores the index. These
compression algorithms are shown below:
Both strategies have their limitations. Delta compression cannot handle data
with offsets that exceed the maximum long value, and table compression cannot
handle high cardinality data as storing the lookup would be costly. Given that
choosing a format is conditional, we’ve introduced an
auto strategy that
scans the data for its maximum offset and cardinality, and determines whether
The graph below compares the
none compression strategies using
generated data sets with 5,000,000 values each.
- enumerated : values are from only a few selections, with probability heavily skewed towards one value
- zipfLow : zipf distribution between 0 to 1000 with low exponent
- zipfHigh : zipf distribution between 0 to 1000 with high exponent
- uniform : uniform distribution between 0 to 1000
- sequential : sequential values starting from a timestamp
The table below shows the bits per value for the
Please note that since the
none strategy doesn’t perform any compression, its
size is the same for all data distributions. The
auto size is directly
proportional to the bits per value.
lz4 performs very well for the enumerate
and zipfHigh data distributions, since these distributions contain the most
Below we illustrate performance results for sequentially accessing all values once:
none strategy has the same performance across all distributions,
since it always reads 8 byte long values. The
auto strategy performance is
directly proportional to the performance of the corresponding bits per value on
the graph above.
lz4 performance is mostly dependent on the decompression speed
for the data set, since the reading is always done on the 8 byte values.
Below are the performance results for sequentially accessing values while skipping randomly between 0 to 2000 values on each read:
Here we can see how direct access strategies (barely visible) greatly outperform block based strategies for sparse reads.
When comparing compression techniques such as
table, it is
important to distinguish that
lz4 can operates on blocks of bytes without any
insight on the context, while
table compression (with variable size
values) requires knowledge of the data.
We can integrate these different compression together and leverage multi-stage
compression, where we can perform a byte level compression such as
and then use a type specific compression such as
In Druid, we’ve decided to name byte level compressions such as
“compression”, and data level compressions such as delta/table “encoding”.
Compression strategies include
none. Encoding strategies include
auto, which chooses between
table encoding formats, and
always store a long as 8 bytes.
Our results with
lz4 compression and auto encoding is shown below:
Looking at the results, it might seem strange that
auto performs so much
longs for skipping data, as the file size and performance for
continuous data is comparable. This can be explained by breaking down the total
reading time in the block layout strategy, which consist of
lz4 decompression +
Byte Buffer copying + reading decompressed values.
auto has better
performance for decompression, while
longs is faster at accessing data.
When reading continuous data, these differences more or less cancel out,
causing performance between the two strategies to be similar. When reading
sparse data, the accessing data portion is basically gone, and the
decompression time difference is shown.
One interesting thing to note is what would happen if each value were read
multiple times, for example if a query required multiple aggregations on the
same column. In such a case, the reading performance would be much more
longs would have better performance than
Unfortunately, there is no one compression and encoding combination that performs the best in all scenarios. The choice is highly dependant on the value distribution of ingested data, storage vs. performance requirements, and issued queries. However, we do recommend four combinations depending on the use case:
longs(default) : good compression size for most cases, worst performance for heavily filtered query, good performance if the column is used by multiple aggregators that fully scan the data
auto: average between
auto. Occasionally offers better compression than
longs, and usually better compression than
auto. This choice is better for filtered queries compared to
longs, and worse for queries that have multiple aggregators for the same column.
auto: offers better compression compared to
longs, but much worse compression compared to
lz4for highly repetitive data. Good performance for reading in general.
longs: requires the most storage space (sometimes magnitudes higher than choices for highly repetitive data), but almost always performs better for all queries.
Finally, as part of this work, we’ve created a Druid tool that can scan segments and benchmark different compression options. You can use it via:
tools check-compression -d segment_directory -o output_file