Top 7 Questions about Kafka and Druid

May 13, 2023
Darin Briskman

With Apache Kafka® already downloaded over 5 million times and growing at about 30% per year, it’s clear that data streaming has evolved from an emerging technology to a foundational technology to connect data across enterprises and cloud providers. 

There’s a similar growth path for Apache® Druid, the real-time analytics database created to provide high-concurrency subsecond queries at TB and PB scale, combining streaming with historical data.

It’s not surprising that Kafka and Druid often go together. While not every project using Kafka has implemented analytics (yet), Druid is the most common choice for stream analytics. While not every project using Druid is using Kafka streams, the majority are using either Apache Kafka or a streaming service with a Kafka-compatible API.

As Kafka users explore Druid, here are a few common questions – and their answers!

Do I need real time analytics?

If you are using Kafka, you probably do. After all, you’ve deployed a messaging service that delivers events at scale. While you likely initially deployed Kafka to support data delivery, you will need observability across your application to ensure its stability and performance, as they do at Netflix and Walmart

You may also need to perform interactive data exploration, investigating anomalies and understanding why things are happening the way they are. Salesforce and Confluent use Druid this way.

Or maybe you’re building an application that will be used directly for analytics by your customers, like Reddit and Atlassian, where you can’t know how many queries will be run each second. Fortunately, Druid’s high concurrency supports thousands of queries running at the same time.

Why isn’t there a Kafka connector for Druid?

You don’t need a connector! Druid has built-in ingestion for Apache Kafka, which also works with Confluent Cloud and other streaming platforms that are compatible with the Kafka API.

So how do Kafka and Druid work together?

It’s very simple. Just set up ingestion for each Kafka topic you want. By default, Druid will create a table where each event key is a dimension (a column in the table). You can, of course, specify how to parse the data, which information to ingest, and, if you prefer, how to rollup the data, such as one row per second or one row per 15 minutes.

Unlike databases that use a Kafka connector, there is no microbatching needed in Druid. Every event is immediately queryable upon arrival, so your real-time data is as fast as your stream.

Druid works with Kafka’s partition and offset features to guarantee exactly-once ingestion. Every event will be ingested into Druid once and only once. Events aren’t removed from the Kafka topic until they are fully committed to Druid, including a copy on durable Deep Storage (either cloud object storage or HDFS), so event data will never be lost, even in the event of a system failure.

Why not just use ksqlDB?

Confluent has created ksqlDB, a database for stream processing that’s available both as Apache-licensed open source and as part of Confluent’s offerings. Using ksqlDB, you can create virtual tables to query Kafka topics, which is a great way to build simple applications with data streams.

For most projects, though, you’ll need to know more than the current state of the stream – you need to understand data in context. How does the current event state compare to an hour ago? To yesterday? To the last time something similar was happening?

With Druid, you get both Kafka stream data in real time AND historical data, which can be the results of streaming in the past or loaded into Druid from files or other databases. 

Druid is able to query very large datasets, with TBs or PBs of data, from both real-time streams and historical tables, in under a second. Druid also supports up to thousands of concurrent queries, providing both high throughput and high concurrency.

Many teams choose to use both ksqlDB and Druid, with ksqlDB for quick data exploration or triggering alerts for specific values to monitor, while using Druid for deep exploration, subsecond queries at scale, high concurrency, and combining real-time and historical data.

How many Kafka events per second can Druid handle?

A lot! Some Druid production clusters are ingesting over 5 million events per second, and over 400 billion events per day. The only limit is the storage and processing power of the cluster’s infrastructure.

There is also no limit on the number of Kafka topics that can be ingested concurrently.

Who’s using this?

Lots of organizations are using Kafka + Druid. We’ve already mentioned Netflix, Walmart, Salesforce, Confluent, Reddit and Atlassian. Others include Target, Swisscom, British Telecom, DBS, and Wikimedia Foundation. There are likely well over 1,000 organization using Kafka + Druid … but since both are open source, no one really knows.

Imply was founded by the creators of Druid to make it easier for developers to create applications with scale, speed, and streaming data. Imply Polaris, the Druid Database-as-a-Service, includes integration with Confluent Cloud for easy push and pull integration with Kafka.

Where can I learn more?

Along with the Apache home pages for Kafka and Druid, you can find more about the Kafka to Druid stack at imply.io

For the easiest way to get started with Kafka and Druid, you can try a free 30-day trial of Imply Polaris—no credit card required!

Other blogs you might find interesting

No records found...
Jul 23, 2024

Streamlining Time Series Analysis with Imply Polaris

We are excited to share the latest enhancements in Imply Polaris, introducing time series analysis to revolutionize your analytics capabilities across vast amounts of data in real time.

Learn More
Jul 03, 2024

Using Upserts in Imply Polaris

Transform your data management with upserts in Imply Polaris! Ensure data consistency and supercharge efficiency by seamlessly combining insert and update operations into one powerful action. Discover how Polaris’s...

Learn More
Jul 01, 2024

Make Imply Polaris the New Home for your Rockset Data

Rockset is deprecating its services—so where should you go? Try Imply Polaris, the database built for speed, scale, and streaming data.

Learn More

Let us help with your analytics apps

Request a Demo