Continuing the Virtual Druid Summit Conversation with Athena Health

When an engaged technical audience asks great questions, it’s easy to run out of time during Q&A. And that’s exactly what happened at Virtual Druid Summit! Because our speakers weren’t able to address all of your questions during the live sessions, we’re following up with the answers you deserve in a series of blog posts.

Karthik Urs, Lead Member of Technical Staff, takes on your remaining questions as a follow-up to “Automating CI/CD for Druid Clusters at Athena Health”.

1. I’m wondering if you can talk about how you actually “test” the Druid cluster with your production queries and data?

There are two aspects here. Testing the queries/benchmarking it falls under performance benchmarking/tuning. We have a test bed with multiple frameworks, prominent ones being jmeter and artillery. With this, we focus on concurrency testing and query response times at peak load. A small disclaimer here: we haven’t started playing with cache yet, so our testing on the benchmarking front is nascent.

The second part, which I assume you also wanted to know, is the validation of data. Our source is on Snowflake. We are currently validating the data by running aggregate queries on Snowflake and on Druid and compare them. We also plan to test cardinality of columns, etc. Snowflake has a concept called time travel which is pretty cool. With that, we can run queries on Snowflake going back in time to a state when the ETL was running.

2. Cool system! Have you thought about automating no-downtime rolling upgrades?

Not yet. Though we have the general idea, we haven’t started automating it.

3. Why such large zookeeper nodes? Do they have heavy mem/cpu usage in your cluster?

The zookeeper nodes are large instances. They are 2 vCPUs each and currently, we are doing a complete backfill of data and we have noticed zookeeper getting overwhelmed at peak ingestion/etl time.

4. Do you do auto scaling of your broker or historical nodes based on queries?

That is the plan going forward. We just recently built our LMM stack on ELK and we have enabled log emitters on each node. The next plan is to scale brokers up based on CPU utilization, # of threads (CLOSED_WAIT mainly, we don’t understand how threads behave in the Druid environment yet) and requests/minute.

5. How is the state maintained for historical? Specifically, how do you handle if a historical node goes down in regards to state?

We are not explicitly handling it (yet). Our understanding is, as long as coordinator is alive and well, it manages the state of historicals. We haven’t put any safeguards in place yet. Looking forward to postmortems of other folks who may have encountered problems in this specific area.

6. How did you handle downtime in production when data transfers from deep storage to historical takes time? For example, when we want to change our instance type. Did you add new servers first and remove the existing ones?

We are not in production yet, but are getting close. We indeed want to have rolling updates on historicals.

You can find Athena Health’s talk and slides here.

Other blogs you might find interesting

No records found...

Jun 16, 2025

10 Years of Imply: From Apache Druid to What’s Next in Real-Time Analytics

We’ve officially hit double digits! Ten years ago, a few Druid-obsessed engineers asked a radical question: What if analytics didn’t have to be slow, stale, or stuck in dashboards? Back then, we were just...

Learn More

May 07, 2025

Real-Time Observability Without Operational Overhead: What’s Next?

Observability is meant to provide clarity, speed, and confidence in modern systems. Yet for many organizations, it has become a source of complexity, cost, and operational drag. Managing pipelines, tuning...

Learn More

Apr 14, 2025

It’s Time to Rethink Observability: The Event-Driven Future

Observability has evolved. Forward-looking teams are already moving beyond static dashboards and fragmented telemetry—treating all observability data as events and unlocking real-time insights across their...

Learn More

Let us help with your analytics apps

Request a Demo

By Functional Use

By Application

FEATURED

DRUID CASE STUDIES

Apache Druid

Content

Support

Other blogs you might find interesting

Let us help with your analytics apps