May 1, 2020
Continuing the Virtual Druid Summit Conversation: Twitter has Answers
Swapnesh Gandhi, Twitter
Thanks again to everyone who attended Virtual Druid Summit, and for being so engaged – as we previously mentioned, our speakers received more than 150 questions across their collective sessions! Unfortunately, there wasn’t enough time to answer all of your very good questions during the live sessions. In an effort to bring you some closure, we’ve invited our esteemed speakers to address the remaining questions in a series of blog posts.
First up: Swapnesh Gandhi from Twitter delivers the answers you’ve been awaiting as a follow-up to “Analytics over Terabytes of Data at Twitter using Apache Druid”.
1. Any specific tunings/optimizations you did for the larger lookups apart from GC?
Yes, we do a few things. We use the lookup tiers feature a lot. We have separate tiers for all historical tiers, for brokers and a canary tier that we use for test deploys. We do not load lookups to peon tasks, we can do that because our realtime and batch use cases do not need these lookups.
During deploys of new lookups, we ensure to restart the cluster slowly to add a bit of entropy as to when these machines load lookups from MySQL. This helps reduce load on MySQL.
2. Do you use Spark for ingestion into Druid or go with standard MapReduce, or do you use a patch by Metamarkets with Spark?
No,we use standard MapReduce.
3. Monitoring: is there a threshold feature/governance to enforce performance expectations? Example: to stop an inefficient query before it impacts network, users, etc.
That’s a good question. We don’t have anything specifically tackling this. However, we are helped by the fact that most of our usage runs through the visualization (Pivot), which tries to optimize queries for interactivity e.g. by running topN queries when possible. Other than that, we just rely on timeouts in various layers of our setup to terminate queries that we can’t run efficiently.
4. By stripping dimensions, do you mean that older segments are dropped using drop rules or the data is reindexed after 30 days?
Yes, we reindex data after 30 days for high cardinality/less used dimensions. Also, for any new dimension we add, we do a cost benefit analysis to understand if the benefit of keeping this dimension beyond 30 days outweighs the costs of keeping it.
5. What do you use for deep storage backup?
A separate Hadoop cluster. We have tools that replicate data to another Hadoop cluster regularly.
6. Why is bigger node size better than bigger cluster size?
That is largely my opinion and as long as you have a reasonable balance, I don’t expect to see large differences in performance.
A couple of reasons I prefer bigger nodes: 1. In my experience, maintenance activities like adding capacity to the cluster, deployments are longer and more tedious for a bigger cluster. 2. We haven’t really tested this explicitly, but likely having large number of historicals could lead to performance issues in brokers. As each broker maintains connections to each historical and needs to manage those connections. At one point, we had twice the size of the overall cluster – as we were migrating from smaller nodes to large nodes and anecdotally we noticed some slowness which might be explained by this.
7. Do you load lookups in Peon task? What is the delay on Peon task startup?
No,we don’t load any lookups in Peon tasks. Both our current batch, realtime use-cases do not require us to load lookups in Peons.
8. What are good resources for the underlying architecture, relevant configuration items and the effects it has on performance and reliability?
9. What type of queries are most used: timeseries, topN, groupBy?
Most of them are timeseries, then topN and lastly group by. A large chunk of our usage flows through the visualization (Pivot) which tries to run timeseries, topN queries when possible.
10. What kind of segment balancer strategy do you use on coordinators?
The default for druid.coordinator.balancer.strategy is cost which doesn’t scale for large clusters in our experience.
We are using the default, we haven’t really tuned or played around with this config. But it’s a good suggestion – we will definitely look into it.
11. Do you have any specific tuning in Yarn for the batch ingestion? What is the queue size and how many queues?
We tune the targetPartitionSize in the paritionSpec for the indexing jobs, this config directly affects the number of reducers the batch ingestion uses. Beyond that, we tune basic things like memory of mappers/reducers etc.
Regarding queues: our queues are separated in 3 tiers and we assign our jobs to these queues based on priority/SLAs. All Druid indexing jobs are assigned to tier1 (highest priority). Since the tier has several other jobs other than Druid indexing jobs, it would be misleading to share the queue size.
12. How do you determine incremental changes and what is the architecture to load this into historical node heap?
I am assuming this question is regarding lookups. We identify new entries based on the last modified timestamp for those entries. We use the druid-lookups-cached-global lookup extension. We store lookup data in MySQL and Druid services load data from MySQL.
13. How do you handle multi-tenancy, if that’s the case?
We do not have a multi-tenant cluster at the moment. However, the platform team here at Twitter, has completed their design and they have a plan for what it should look like.
This is from Zhenxiao Luo who is working on that: “There are storage multi-tenancy and computation multi-tenancy, for storage side, we are planning to work on storage isolation, including storing each user’s segments under their own HDFS directory, instead of in a shared directory. For computation side, we plan to work on having reserved pools for historical nodes, so that some historical nodes computation is dedicated for some use cases.”
The reserved pools for historicals is something they will implement, it’s not an open source feature at the moment. Feel free to contact us, if you want more details.
14. Do you find the Imply offering costly compared to running open source Druid?
A major reason for using Imply is for their visualization tool: Pivot. Like I mentioned before, we did look at other options when we were building out our offering and decided that Imply is the right choice for us.
15. What is the query granularity?
Most of our data sources are hourly, there are a couple smaller data sources that are daily granularity.