Continuing the Virtual Druid Summit Conversation with Athena Health
May 11, 2020
Karthik Urs, Athena Health
When an engaged technical audience asks great questions, it’s easy to run out of time during Q&A. And that’s exactly what happened at Virtual Druid Summit! Because our speakers weren’t able to address all of your questions during the live sessions, we’re following up with the answers you deserve in a series of blog posts.
1. I’m wondering if you can talk about how you actually “test” the Druid cluster with your production queries and data?
There are two aspects here. Testing the queries/benchmarking it falls under performance benchmarking/tuning. We have a test bed with multiple frameworks, prominent ones being jmeter and artillery. With this, we focus on concurrency testing and query response times at peak load. A small disclaimer here: we haven’t started playing with cache yet, so our testing on the benchmarking front is nascent.
The second part, which I assume you also wanted to know, is the validation of data. Our source is on Snowflake. We are currently validating the data by running aggregate queries on Snowflake and on Druid and compare them. We also plan to test cardinality of columns, etc. Snowflake has a concept called time travel which is pretty cool. With that, we can run queries on Snowflake going back in time to a state when the ETL was running.
2. Cool system! Have you thought about automating no-downtime rolling upgrades?
Not yet. Though we have the general idea, we haven’t started automating it.
3. Why such large zookeeper nodes? Do they have heavy mem/cpu usage in your cluster?
The zookeeper nodes are large instances. They are 2 vCPUs each and currently, we are doing a complete backfill of data and we have noticed zookeeper getting overwhelmed at peak ingestion/etl time.
4. Do you do auto scaling of your broker or historical nodes based on queries?
That is the plan going forward. We just recently built our LMM stack on ELK and we have enabled log emitters on each node. The next plan is to scale brokers up based on CPU utilization, # of threads (CLOSED_WAIT mainly, we don’t understand how threads behave in the Druid environment yet) and requests/minute.
5. How is the state maintained for historical? Specifically, how do you handle if a historical node goes down in regards to state?
We are not explicitly handling it (yet). Our understanding is, as long as coordinator is alive and well, it manages the state of historicals. We haven’t put any safeguards in place yet. Looking forward to postmortems of other folks who may have encountered problems in this specific area.
6. How did you handle downtime in production when data transfers from deep storage to historical takes time? For example, when we want to change our instance type. Did you add new servers first and remove the existing ones?
We are not in production yet, but are getting close. We indeed want to have rolling updates on historicals.
You can find Athena Health’s talk and slides here.
Other blogs you might find interesting
No records found...
Dec 19, 2025
The Most-Read Imply Blogs of 2025 (and what they signal for 2026)
Before we take on 2026, let’s rewind. 2025 was the year observability teams stopped asking, “How do we reduce data?” and started asking the real question: “How do we build an architecture that can keep...
Observability is at a crossroads For years, observability has promised to give teams the visibility they need to keep digital services resilient. But as data volumes explode, many leaders are realizing the...
The Observability Warehouse that helps you keep more data, move faster, and spend less without changing how you work Observability Is Hitting Its Limits Splunk has long been the system of record for observability...