Introducing Apache Druid 27.0.0

Aug 11, 2023
Will Xu

Apache Druid® is an open-source distributed database designed for real-time analytics at scale.

Apache Druid 27.0 contains over 350 commits & 46 contributors.

This release’s focus is on stability and scaling improvements. We are introducing Smart Segment Loading as a new mechanism for managing data files (segments) as the database scales. At the same time, we have improved schema auto-discovery to address various edge cases. Lastly, we are introducing a long-awaited feature – querying from deep storage. We are excited to share this release with you and await your feedback.

Web console explore view

To find the Explore view, open Druid web console and click on … to the top right

This new Explore view is stateless and backed by SQL. It enables you to quickly visualize data within Druid using a point-and-click UI.

Asynchronous query and Query from deep storage

Druid has historically been optimized for high concurrency, low latency workloads. To provide guaranteed low latency performance, Druid requires the data to be pre-loaded onto historical nodes, which behaves like a pre-fetched cache during query time. The entire system is optimized for running queries that take no more than a few seconds.

However, at times you might want to execute longer-running, reporting-style queries. For example, looking at data from a quarter or a year back for trend analysis or comparison. Having data pre-loaded to service these types of queries can become cost prohibitive while executing such long-running queries can consume resources that would otherwise be available for low-latency, interactive queries, resulting in timeouts.

Thus, to solve those problems, in this release, we are introducing a new style of query that runs asynchronously, which can process vast amounts of data in the background. What’s more, this new query execution mode allows Druid to query data stored in deep storage without pre-caching. This works even if your data is not Druid segments. At the same time, we ensure long-running queries do not take up resources from low-latency queries. This feature unlocks the ability for you to de-couple compute and storage to find the optimal mix of compute and storage for your workloads.

While this is an experimental feature, it’s built on a reliable, battle-tested query engine, and we encourage you to try it out with real production use cases and let us know your feedback.

Smart segment loading

The Coordinator process gives Druid clusters the ability to balance, scale, and heal themselves automatically. In this release, we have made substantial improvements to the Coordinator process.

At the core are changes to the  Coordinator’s ability to prioritize and cancel loading requests to Historical nodes. When there are operations that cause a surge in replication requests, as you may find in large clusters, the Coordinator can now handle things much more gracefully, helping to reduce disruptions on cluster operations at scale.

Imagine you add 10 new nodes to your cluster. Instead of evening out the data files across all the nodes, with the new Smart segment loading strategy, the Coordinator will prioritize recent over old data. This applies to cases such as rolling upgrades, network outages, large batch ingestion, and many more.

In the past, to avoid a sudden surge of replication that might cause cluster instability, you may have used `replicationThrottleLimit`. But this is no longer required as the Smart segment loading system will automatically compute those values.

The Smart segment loading strategy means simpler configuration options for the Coordinator. Starting from this release, the following values are considered deprecated and will be removed in future releases.

  • maxSegmentsInNodeLoadingQueue
  • maxSegmentsToMove
  • replicationThrottleLimit
  • useRoundRobinSegmentAssignment
  • useBatchedSegmentSampler
  • emitBalancingStats

This set of feature changes enabling Smart segment loading will keep your Druid cluster much happier with much less chance of down times.

To try this out, please follow this part of the documentation.

Type aware schema auto-discovery and Array columns GA

In the Druid 26.0 release, we introduced support for type-aware schema auto-discovery as an experimental feature. In this release, we are graduating the support of the schema auto-discovery feature into GA status. It follows improvements to overall stability, especially around handling null value cases.Memory-based subquery limit

One more thing that we’d like to highlight in this release is the option to change the subquery guardrail from rows to bytes by setting maxSubqueryBytes.

In the past, Druid had a built-in guardrail for subqueries to prevent subqueries from running out of memory, leading to process failures. This guardrail is based on the number of rows and defaults to a relatively safe value of 500,000 rows.

However, if the subquery has a high number of columns, the server still has a chance to run out of memory. And more commonly, because this is a relatively safe threshold, Druid stops running many queries while plenty of memory is still available. This is especially true in cases of broadcast join queries.

The new memory-based subquery limit uses the frame data structure introduced as part of the multi-stage query engine to help measure and limit the amount of memory available for a given sub-query. It provides a safer and more flexible way of configuring the subquery guardrail.

Apache Iceberg support

Druid now has a new extension that can read and ingest data files from Apache Iceberg.

Iceberg-Druid integration is implemented as a standard Druid inputSource; meaning it can be used in both native batch and multi-stage query engines through EXTERN.

To read data from Iceberg into Druid, you must provide both the catalog (the host for metadata information) and the warehouse (the data file location).

Excitingly, when coupled with Asynchronous queries, you are now able to query Iceberg data through Druid directly.

Improving platform compatibility

We are constantly looking for ways to expand the compatibility of Druid with the underlying infrastructure fabric. In this release, Druid has more mature Graviton support by enabling system metrics and officially supports Java 17.

Last but not least, there are numerous improvements to the k8s native ingestion experience, including most important a change so that tasks will now be queued rather than being rejected if the Kubernetes task runner capacity is full.

Try this out today

For a full list of all new functionality in Druid 27.0.0, head over to the Apache Druid download page and check out the release notes!

Stay Connected!

Are you new to Druid? Check out “Wow, That was Easy” in our Engineering blog to get Druid up and running.

Check out our blogs, videos, and podcasts!
Join the Druid community on Slack to keep up with the latest news and releases, chat with other Druid users, and get answers to your real-time analytics database questions.

Other blogs you might find interesting

No records found...
Sep 06, 2024

Real-time analytics architecture with Imply Polaris on Microsoft Azure

This article provides an architectural overview of how Imply Polaris integrates with Microsoft Azure services to power real-time analytics applications.

Learn More
Jul 23, 2024

Streamlining Time Series Analysis with Imply Polaris

We are excited to share the latest enhancements in Imply Polaris, introducing time series analysis to revolutionize your analytics capabilities across vast amounts of data in real time.

Learn More
Jul 03, 2024

Using Upserts in Imply Polaris

Transform your data management with upserts in Imply Polaris! Ensure data consistency and supercharge efficiency by seamlessly combining insert and update operations into one powerful action. Discover how Polaris’s...

Learn More

Let us help with your analytics apps

Request a Demo