Continuing the Virtual Druid Summit Conversation with Athena Health
May 11, 2020
Karthik Urs, Athena Health
When an engaged technical audience asks great questions, it’s easy to run out of time during Q&A. And that’s exactly what happened at Virtual Druid Summit! Because our speakers weren’t able to address all of your questions during the live sessions, we’re following up with the answers you deserve in a series of blog posts.
1. I’m wondering if you can talk about how you actually “test” the Druid cluster with your production queries and data?
There are two aspects here. Testing the queries/benchmarking it falls under performance benchmarking/tuning. We have a test bed with multiple frameworks, prominent ones being jmeter and artillery. With this, we focus on concurrency testing and query response times at peak load. A small disclaimer here: we haven’t started playing with cache yet, so our testing on the benchmarking front is nascent.
The second part, which I assume you also wanted to know, is the validation of data. Our source is on Snowflake. We are currently validating the data by running aggregate queries on Snowflake and on Druid and compare them. We also plan to test cardinality of columns, etc. Snowflake has a concept called time travel which is pretty cool. With that, we can run queries on Snowflake going back in time to a state when the ETL was running.
2. Cool system! Have you thought about automating no-downtime rolling upgrades?
Not yet. Though we have the general idea, we haven’t started automating it.
3. Why such large zookeeper nodes? Do they have heavy mem/cpu usage in your cluster?
The zookeeper nodes are large instances. They are 2 vCPUs each and currently, we are doing a complete backfill of data and we have noticed zookeeper getting overwhelmed at peak ingestion/etl time.
4. Do you do auto scaling of your broker or historical nodes based on queries?
That is the plan going forward. We just recently built our LMM stack on ELK and we have enabled log emitters on each node. The next plan is to scale brokers up based on CPU utilization, # of threads (CLOSED_WAIT mainly, we don’t understand how threads behave in the Druid environment yet) and requests/minute.
5. How is the state maintained for historical? Specifically, how do you handle if a historical node goes down in regards to state?
We are not explicitly handling it (yet). Our understanding is, as long as coordinator is alive and well, it manages the state of historicals. We haven’t put any safeguards in place yet. Looking forward to postmortems of other folks who may have encountered problems in this specific area.
6. How did you handle downtime in production when data transfers from deep storage to historical takes time? For example, when we want to change our instance type. Did you add new servers first and remove the existing ones?
We are not in production yet, but are getting close. We indeed want to have rolling updates on historicals.
You can find Athena Health’s talk and slides here.
Other blogs you might find interesting
No records found...
May 21, 2026
A First Look at Lumi Loglake: Query Logs Where They Live
TL;DR: Imply Lumi Loglake is a lakehouse (separated compute/storage) architecture for unstructured logs that reduces costs from 40% up to orders of magnitude on your hardware/AWS/Azure bill used to run your...
Imply Lumi Major Release Preview: Continuing the Journey Towards Decoupled Observability/SIEM
We are getting ready to introduce the next major expansion of Imply Lumi and the observability warehouse. When we introduced the industry’s first observability warehouse, the goal was clear: decouple the...
Imply Lumi's Grafana Loki integration is now in Private Preview. The same logs you've loaded into Lumi for Splunk are now queryable natively in Grafana using LogQL with no second pipeline, no duplicate storage,...