Imply Videos

Dec 9, 2022

Pushing the Limits of Druid Real-time with Data Sharding

Most Druid applications aim for sub-second analytics queries – but this becomes difficult when dealing with large volumes of real-time data. We’ll take you through our series of engineering and ops efforts that ultimately helped us reach sub-second real-time queries at scale with Druid by using a custom data sharding mechanism.

At Singular, we process and store our customer’s marketing data so they can run complex analytical queries on it and ultimately make smarter, data-driven decisions. While many of these decisions can be based on historical batch-ingested data, more and more decisions are being made based on fresh real-time data.

With Apache Druid as our client-facing single source of truth, we have utilized its real-time Kafka ingestion effectively for some time, ingesting all the real-time data into a single data source. As we grew, however, our real-time data source got larger and larger and became a significant bottleneck, causing spikes in query times and service disruption. After researching the issue, we discovered the problem: since we had a single real-time data source, different customers’ data was stored in the same segments. When running a heavy query, this caused almost all resources to be allocated to that query. It also caused the data to be stored sub-optimally and significantly increased the volume of data that needed to be scanned.

The simple solution was to batch-run optimizations and compaction tasks on the real-time data. However, this was only partly effective and was a costly operation. Another simple mitigation, allocating a scaled-up, dedicated Historical tier to serve the real-time data, proved effective in increasing stability, but did not improve query time significantly, as well as proving to be very costly.

Ultimately, we went for the more complex solution: split our real-time data into different “shards”, so any given query would only run over a fraction of the total data. The sharding solution involved many operational complexities. We need to know what data is in each shard, keep our shards evenly balanced, and tune both our stream (Kafka) and Druid to deal with the decrease in tenancy and new ingestion tasks. On top of this, we had to implement, deploy, and migrate workloads while maintaining full data availability for our customers. Since overcoming these challenges and releasing our Druid real-time sharding solution, we’ve seen significantly lower query times, better-optimized data storage, and faster data ingestion, ultimately creating a more efficient and reliable product.