Druid Operator: Simplifying the management of Apache Druid in Kubernetes

Jun 13, 2023
Reena Leone

On this episode, Adheip Singh, founder of DataInfra, discusses the benefits and functionality of Druid Operator, a tool designed for managing Apache Druid deployments in Kubernetes. The operator acts as a bridge between Kubernetes and Druid, simplifying the process of scaling Druid clusters, supporting high availability and fault tolerance, and integrating with logging tools for troubleshooting and performance analysis. 

Listen to the episode to learn:

  • How Druid Operator enhances the user experience of running Druid on Kubernetes
  • How it provides a self-service platform for managing multiple Druid clusters efficiently and effectively
  • What improvements and customization options are in the works, such as the development of an ingestion controller for managing batch indexing jobs

Learn more

About the guest

Adheip Singh is the founder of DataInfra where he is building a centralized control plane for SAAS Infra. Previously, he has built SAAS solutions for Druid at Rill, Pinot at Startree, and Clickhouse at ChistaData. He also maintains Druid and Pinot Kubernetes operator’s in the community.


[00:00:00.890] – Reena Leone 

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host Reena from Imply and I’m here to bring you stories from developers doing cool things with Apache Druid, real time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. And today we are talking about Druid Operator. For those who are unfamiliar, Druid Operator is a tool specifically designed for managing Apache Druid deployments in Kubernetes, which is one of my favorite topics for the show right now. And if by some chance you are unfamiliar with Kubernetes, it’s an open source platform that automates deployment, scaling, and management of containerized applications. And we are going to talk all about that today. To break it all down, I am joined by Adheip Singh, founder of Datainfra and a committer to Druid operator. Adheip, welcome to the show. 

[00:00:51.220] – Adheip Singh 

Hi Reena. 

[00:00:52.050] – Reena Leone 

Before we get into Druid Operator, I’d like to learn a little bit more about you and your journey. So can you tell me a little bit more about yourself and how you got to where you are today? 

[00:01:02.400] – Adheip Singh 

My name is Adheip Singh and I’m based out of India, Bangalore, and I started my Druid journey in 2019 on working on Apache Druid. It was a POC to stream call data records from Kafka to Druid and then to build some visualizations using Apache superset. At that time, I was very new to the Druid ecosystem, so I was exploring what’s the best way to run Druid. And in 2019, the Druid operator got open sourced and I was the second contributor to this project. Since then, I’ve been diving deep into the Druid operator ecosystem and trying to run Druid on Kubernetes. And so far I’ve been here. 

[00:01:45.520] – Reena Leone 

It’s awesome that you mentioned Kubernetes because I feel like in the Druid community that’s like a hot topic. And when we were talking before, there hasn’t been like too much info on it so far or it’s like it’s starting to get built out. But let’s take a step back and talk about Druid Operator and what it is for those who don’t use Druid or haven’t used operator, can you tell me a little bit more about it? 

[00:02:07.880] – Adheip Singh 

I’ll step one step back and just try to explain what exactly is an operator? Operator was a term coined by Core OS and operators are basically Kubernetes controllers which reconcile state of custom resources. So Druid isn’t designed to run on Kubernetes. When it came, it was basically all the configuration management, all the tooling was around virtual machines. So to run Druid like a complex distributed database on Kubernetes, kubernetes is not aware about Druid. It sees everything as a stateful set as a pod. So the operator is essentially a bridge between what Druid wants and how to interpret that in Kubernetes. So the operator pattern is basically you extend Kubernetes API by creating custom resources. And when you create those custom resources, operator looks, watches and reconciles those specific custom resources and creates the Druid clusters. So the operator is an intelligent software which is like a bridge between Druid and Kubernetes and manages the state of Druid. 

[00:03:22.150] – Reena Leone 

So what are some of the main benefits of using Druid Operator in a Kubernetes environment? 

[00:03:27.750] – Adheip Singh 

Sure. So the current ecosystem, if you want to run Druid on Kubernetes, the most common way of running is using Helm charts. Helm is basically a configuration management tool. What Helm solves is configuration management. Once your configurations have been applied to a Kubernetes cluster, you need a piece of controller. You need a software which will basically reconcile those configurations. So there are two things who is responsible for applying configurations and who is responsible for reconciling those configurations. So the Druid operator will basically reconcile those configurations. What’s the benefit to this ecosystem is Druid has different types of nodes. It has

broker, it has coordinators, it has router, it has historicals. Then you can do tiering between these historicals. So how do you define a single manifest and say, this is what I want to achieve? When you want to deploy a cluster, essentially the Druid custom resource is basically a single manifest file where you define the Druid desired state. This is what you desire a Druid cluster to look like. And when you submit that spec to the Kubernetes API, the operator will create a Druid cluster for you. The operator manages upgrades. When you want to roll out a new Druid version, the operator makes sure that the version is rolled out to different all the Druid pods in an incremental way. 

[00:04:52.140] – Adheip Singh 

So you define the order and the operator will make sure that at first it will upgrade historicals. Once historicals are fully upgraded, only then it will move to the next node. And in case of when you’re rolling out and you face any issue or the pods go into a bad state, the operator halts. In this way, your whole read cluster is still up and running. You can serve queries. There might be one pod which is a back state, so you can always come troubleshoot. The second benefit to this ecosystem is Druid operator helps in scaling Druid clusters. So it’s basically Druid can be scaled. You can scale pods horizontally or you can scale them vertically. Scaling of a system like Drude requires additional metrics. So as of now, the operator supports scaling Druid clusters vertically. It also supports horizontal pod auto scaling API, which is Kubernetes native API. So you can mention your HP aspects and the operator will make sure it reconciles and applies those configurations. Scaling Druid stateful sets, which is historical, and the other node types. Druid operator will basically add more storage to the underlying persistent volume claims. So the operator, whenever you want to increase the size of your storage, which if your PVC is, let’s say, like at 50 GB and you want to move your historical persistent volume claims to 100 GB, you can just edit in your custom resource spec the operator on the underlying Kubernetes side. 

[00:06:36.110] – Adheip Singh 

It will delete the stateful sets, expand the volume class, expand the volumes, and you’ll basically do the scale up. So it’s basically automating all the manual steps required to scale a Druid cluster. 

[00:06:48.670] – Reena Leone 

It sounds like it really simplifies the process of scaling Druid clusters. 

[00:06:52.950] – Adheip Singh 

Yes, as I said, scaling Druid is very complicated. It requires metrics. So the Druid operator isn’t touching all the aspects of scaling, it’s touching the aspects of horizontally scaling using the Kubernetes API, because Kubernetes API will have the metrics to scale the Druid cluster on the basis of CPU memory. And the specific feature which was built was scaling Druid clusters vertically. By adding more storage. 

[00:07:24.630] – Reena Leone 

Does the operator integrate with any monitoring or logging tools so you can troubleshoot or do a performance analysis if you need to? 

[00:07:33.160] – Adheip Singh 

In the Druid operator, we support adding log for J. That basically is Druid’s native way of logging. And the operator specifically does not have whenever operator itself runs as a deployment in Kubernetes. And the native way of logging in Kubernetes is STD out. So if you have any agent running in your ecosystem, whether it’s fluently or vector, you can always collect the operator logs and you can ship it to a centralized storage. I personally have been using Parsable. It’s a log storage system. And on the Druid operator repo, we have integrated Druid operator sending logs to Parsable for our end to end tests. 

[00:08:19.530] – Reena Leone 

I mean, I know that open source technology has a lot of iterations and a lot of things in the works currently. Are there any limitations or drawbacks to using operator at the moment? Or is there anything that you’re working to improve?

[00:08:33.890] – Adheip Singh 

Definitely Druid is a complex distributed system. And when you want to run it on Kubernetes, the basic way of using helm charts, as I said before, is itself that there are a lot of configurations. So the operator, when some developer wants to adopt the operator in an organization, there is always a chance of increasing the complexity. So because you are adding another component between your native way of doing Kubernetes and the Druid ecosystem, so operator definitely adds some level of complexity. But when you try to understand what benefit it’s really bringing, it really improves the user experience of running Druid. A single Druid operator can run up to 40, 50 Druid clusters. And this was something which, when I was at Rill Data, Rill Data was a SaaS platform for Apache Druid. And during my time at there we ran, a single Druid operator was reconciling up to 40 Druid clusters in production, and it was upgrading all the clusters at once. So that’s the power of this operator. The operators, they can give you a self served Druid platform. 

[00:09:50.450] – Reena Leone 

Diving in like. Little bit deeper. How does Druid operator ensure high availability and say, fault tolerance in a Druid cluster? 

[00:09:58.600] – Adheip Singh 

That’s a very interesting question. So Kubernetes itself provides a lot of fault tolerant mechanisms and operator isn’t rebuilding any of it’s not rearchitecting or introducing any new concepts, it’s leveraging existing Kubernetes concepts. So whenever you apply a spec that is basically a desired state. You desire this is how a Druid cluster should look like. The operator maintains the state of the application. So if you are doing any change, the operator will always reconcile the configurations. So if let’s say in a production environment, you come and you delete a stateful set by mistake, the operator will make sure it recreates the stateful set because it’s maintaining the state. It’s aware that what is a desired state and what is the current state. So the operator handles a three way configuration, the three states, one is the original state, then you have the current state and then you have the desired state. So the operator is aware of all the three and at each point of its reconciliation. So the operator runs like a reconciliation rule. The underlying controller pattern is a combination of event driven plus polling. So it’s not exactly a state machine, it’s based on observed state. 

[00:11:28.300] – Adheip Singh 

So whenever an event occurs, it will observe the state and build the logic for it. So at each point the operator is very aware of what’s happening in a Druid cluster. So all the fault tolerant mechanisms of recreation, of pods, of upgrading, of rollbacks the operator can handle, 

[00:11:48.200] – Reena Leone 

Would there be any use cases where operator would need to be customized or extended to better suit those requirements of requirements of a specific use case? 

[00:12:02.250] – Adheip Singh 

Yes, definitely. In the Druid community, one thing which I’ve been working on is building an ingestion controller in the Druid operator. So operator is a higher level abstraction underlying inside the operator everything is controllers. So we are building a new ingestion controller. This ingestion controller is responsible for reconciling your ingestion config. So in Druid, if you want to run a if you want to start a batch indexing job, you need to submit an ingestion spec. And in that ingestion spec you define your schema, you define all your inputs and the Druid operator will basically reconcile those configurations. So you can express your ingestion spec as a Kubernetes manifest and the operator will maintain the state of that particular ingestion job. So if you want to delete, if you want to create update, you can just apply on Kubernetes, the ingestion controller will reconcile. So this is something which we have been working in the Druid community and there is a work in progress and we are planning to support all the native ingestion base. 

[00:13:14.700] – Reena Leone 

Oh, that’s perfect because I was actually just going to ask you like what’s in store for operator, what are you working on right now? Because I know we have several Druid releases coming up this year. Is there anything else that you’re actually is there anything else you’re working on in regards to Druid.

Not just operator? Is this your primary focus? 

[00:13:35.040] – Adheip Singh 

At Datainfra, I have a wider focus, which is to run OLAP and query engines on Kubernetes using control planes. I see Kubernetes as a control plane, not just as an orchestration platform. On Druid operator specific, the operator was initially authored by Himanshu Gupta, who is a Druid PMC and it was in the Drud.io repo. As the community progressed, we shifted the operator from Druid io to DataInfra repo. So the previous operator was lacking a lot of documentation maintenance. So this year has been mostly focused around building new documentation and precisely around evangelizing the correct way to run Druid operator, whatever myths are between helm versus operator and building tutorials. So I’m focused on this aspect of improving the Druid operator and definitely in the development side the Ingestion Spec controller . 

[00:14:37.380] – Reena Leone 

For our listeners, where can they find that documentation in those tutorials? Are they on GitHub? Are they on your website? Where could they look those up? 

[00:14:46.030] – Adheip Singh 

So my website is still work in progress, but by end of this month and they will be on DataInfra website under Druid operator. 

[00:14:57.240] – Reena Leone 

Well Adheip, thank you so much for joining me today and talking through Druid operator. I am so psyched to have you on the show. I can’t wait to see those tutorials when they go live. And good luck with launching your website. 

[00:15:08.740] – Adheip Singh 

Thank you. Reena I’d like to just thank the Druid PMC, specifically Gian [Merlino], for helping moving the Druid operator repo and collaborating in all the aspects of running Druid on Kubernetes. 

[00:15:23.590] – Reena Leone 

If you want to learn more more about things that we talked about today, including Apache Druid, please visit druid.apache.org. And if you want to learn more about Imply, visit imply.io. Until next time, keep it real.

Let us help with your analytics apps

Request a Demo