Overview
Confluent runs multi-tenant Kafka clusters across AWS, GCP, and Azure, powering thousands of customer workloads. To ensure reliability at scale, the company built a next-generation observability platform powered by Kafka and Apache Druid.
“We built an observability platform powered by Kafka and Druid. This solution ingests over 5 million events per second and handles hundreds of queries on top of that. And this gives us real-time insights into the operations of thousands of these Kafka clusters within Confluent Cloud.”
— Jay Kreps, CEO | Confluent
Challenge
As adoption of Confluent Cloud accelerated, telemetry volumes surged.
“Our telemetry pipeline ingests millions of events per second from thousands of multi-tenant clusters across cloud providers,” said Harini Rajendran, Senior Software Engineer at Confluent.
Legacy monitoring tools couldn’t keep pace, making it difficult to deliver timely alerts and insights both internally and to customers.
Solution
Confluent modernized its observability stack with a real-time analytics platform designed for streaming workloads. Kafka pipelines capture telemetry from every cluster and service, while Druid powers real-time query processing across high-cardinality metrics such as partitions, offsets, and consumer groups.
“By splitting high-cardinality data sources and introducing hot data tiers, we reduced P95 query latency by nearly 75%,” said Rajendran.
Results
- 5M+ events per second ingested and processed in real time
- 75% reduction in P95 query latency
- Faster SRE response with improved alerting and lineage visibility
Why it Matters
Real-time observability is mission-critical for Confluent’s cloud service. By scaling its telemetry stack with Kafka and Druid, Confluent ensures:
- Reliable customer experiences across thousands of clusters
- Faster incident resolution and reduced downtime
- More efficient operations across its multi-cloud environment
Together, the executive vision and engineering execution created a strong foundation for Confluent’s continued growth and customer trust.
Confluent runs its mission-critical observability on Apache Druid. That same Druid query engine powers Imply Lumi, the industry’s first Observability Warehouse—delivering the real-time performance, scalability, and cost efficiency you need across your observability stack.