Earlier this week, the Apache Druid community released Druid 0.17.0. This is the project’s first release since graduating from the Apache Incubator, and it therefore represents an important milestone. As always, you can visit the Apache Druid download page to download the software and read full release notes detailing every change. This Druid release is also available as part of the Imply distribution, which includes Imply Pivot as well.
I’ll talk about a few of the major items in this release below.
Input sources and formats
Druid 0.17.0 includes a comprehensive, ambitious revamp of the ingestion layer to be oriented around new “inputSource” and “inputFormat” concepts. These new concepts replace the “parser” and “firehose” concepts, which are still supported but are now deprecated. Druid’s documentation and tutorials have been updated to reflect the new inputSource and inputFormat concepts, and I encourage you to take a look at how they work.
As part of this effort, the input format options for batch and streaming ingestion have become more aligned, and the capabilities of native batch ingestion have been amped up in two important ways: it now includes support for binary formats like ORC and Parquet, and can now read data from HDFS.
With these changes, many Druid users will be able to migrate away from Hadoop Map/Reduce and towards Druid’s native ingestion, simplifying deployments and reducing costs.
Range partitioning for native batch
Continuing on the theme of ingestion improvements, in this release we’ve added support for range partitioning to native batch ingestion. This functionality has always been available in Druid’s Hadoop-based ingestion, but is now available without the need to use Hadoop.
The Druid documentation discusses why partitioning is important, which boils down to performance and footprint. By improving locality, proper partitioning speeds up any queries that filter or group on partitioned columns, and improves compression of those columns and any other columns that are highly correlated with them.
If you aren’t taking advantage of partitioning already, I encourage you to give this a try. Our customers have found that it can have a massive impact on their use cases. In some cases we’ve seen up to 3x improvement in footprint and more than that in performance. You may or may not see a similar impact, but it’s worth experimenting if you are interested in improving performance or reducing costs.
Parallel merging on Brokers
As Druid users know, its query stack is massively scalable. Customers have deployed Druid clusters composed of thousands of servers and achieve subsecond performance on massive amounts of data. This is powered by a design that involves fan-out from a Broker to a variety of data servers: the Broker is responsible for receiving a query, sending it down to data servers, and merging their result sets. The data servers all return result sets that represent aggregated data from that particular server.
The idea here is that the amount of raw data being queried is massive (billions or trillions of rows), but the aggregated results are much smaller than that (hundreds, thousands, or millions of rows). To scale out the raw-data-scanning stage, a single query can use every data server if needed. But a single query can only use a single Broker, and until this release, could only use a single thread on that Broker.
In Druid 0.17.0, we’ve added parallel result merge capability to Brokers. This allows a single query to use multiple threads on the Broker that handles that particular query, and can dramatically speed up queries with large result sets that are coming from large numbers of data servers. This feature is on by default, and most users should not need to change any of its configuration properties, but they’re there if you need them.
Towards SQL compliant null handling
In this release, we have added official documentation for Druid’s SQL-compatible null handling mode. It was originally contributed as part of the effort to integrate with Apache Hive, but was not exposed as an official user-facing feature until today. This is an important move along the road to expanding SQL support in Druid.
This mode stands in contrast to Druid’s default, legacy null handling mode, which has two essential behaviors: nulls and empty strings are treated equivalently for all string operations, and numeric nulls are implicitly converted to zeroes for numeric operations. In SQL compliant mode, Druid differentiates between nulls and normal values at both the storage and query layers. It also takes them into effect when planning and optimizing Druid SQL queries.
The SQL compliant null handling mode is not on by default, and does have a couple of small known issues; see the release notes for details. But it has been greatly improved for this coming-out release, and is substantially faster and more comprehensive than it was in prior Druid releases where it was an unofficial feature.
If you have a need for additional SQL compliance I encouraged you to try out this functionality and let us know your feedback. We intend to continue improving this mode and then make it the default behavior for Druid in a future release.
Other items
For a full list of all new functionality in Druid 0.17.0, head over to the Apache Druid download page and check out the release notes!