How Apache Druid Revolutionized Digital Turbine’s Analytics Infrastructure with Lioz Nudel and Alon Edelman
Who better to talk about real-world usage of Apache Druid than Digital Turbine, a leading mobile growth and monetization platform? They go way back with Druid. On this episode Lioz Nudel, Engineering Group Manager at Digital Turbine and Alon Edelman, Data Architect at Digital Turbine discuss how Druid has significantly improved their analytics infrastructure in terms of performance and scalability. We cover their journey from using MySQL to Druid, highlighting the scalability, performance, and agility that Druid offers and delve into specific use cases, such as analyzing massive amounts of data and managing cloud computing costs.
Digital Turbine’s analytics landscape was once a complex network of systems, each with its limitations and challenges. However, with the introduction of Apache Druid, their entire analytics infrastructure underwent a remarkable transformation. Druid seamlessly replaced multiple analytics systems, becoming the core of their analytical operations. From offering secure access to customer-specific data to empowering internal teams with efficient reporting capabilities, Druid’s performance and ease of use were unparalleled.
Before adopting Apache Druid, Digital Turbine faced significant scalability issues with their previous analytics databases. As their data volume skyrocketed to 6 billion events per hour, conventional SQL databases like MySQL struggled to keep up. Apache Druid emerged as the perfect solution, offering the scalability and stability needed to process massive amounts of data swiftly. With Druid, Digital Turbine’s programmers could implement new features and make changes within hours or days, instead of weeks or months.
Listen to the episode to learn
- How replacing legacy systems with Apache Druid brought scalability, performance, and cost-effectiveness to Digital Turbine’s analytics infrastructure
- How they overcame challenges with Druid, including be such early adopters, they had to create their whole infra on their own
- How Druid has become the backbone of Digital Turbine’s data operations
Learn more
- Druid Architecture & Concepts
- The Dynamic Duo: Apache Druid and Kubernetes
- Digital Turbine: a one-stop platform for user acquisition growth and monetization
About the Author
Lioz Nudel is the Engineering Group Manager at Digital Turbine and Alon Edelman is a Data Architect at Digital Turbine
Transcript
[00:00:00.250] – Reena Leone
Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, real-time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. And today we are talking to Digital Turbine. They deliver end-to-end products and solutions for mobile operators, device OEMs, and other third parties to help them effectively monetize mobile content. And they’re Apache Druid users. Today’s guests are from the Ad Tech side, which you should know by now is a perfect use case for Druid. Joining me from DT today are Engineering Group Manager Lioz Nudel and Data Architect Alon Edelman. Lioz, Alon, welcome to the show.
[00:00:45.550] – Lioz Nudel and Alon Edelman
Hey, nice to be here.
[00:00:47.340] – Reena Leone
So I like to start off every show with like, a little bit about your background, how you got to where you are today. So Lioz, do you want to start?
[00:00:55.330] – Lioz Nudel
Yeah, sure. So I started around ten years ago, way back in Interactive, three acquisitions ago. Currently we’re at DT Digital Turbine. I started as Integration Manager and moved to the DevOps area around eight years ago. And currently I am the group lead of the DevOps and Data group.
[00:01:16.300] – Alon Edelman
And I am Alon. I joined 13 years ago as a DBA here, and now I’m a data architect that helps with all the different databases here in DT.
[00:01:27.600] – Reena Leone
Awesome. So can you tell me a little bit about DT, a little bit about Digital Turbine, what you guys do?
[00:01:33.320] – Lioz Nudel
Yes. So in one sentence, we can call it a one stop shop for a publisher, both for the user acquisition side and the app monetization. Basically, me and Alon comes from the adtech side of DT, and we’re working on the exchange. We have there around 6 billion requests per hour, which generates around 15 terabytes of data, of raw data.
[00:02:02.710] – Reena Leone
Okay, so this makes sense for the Druid use case, which we’re going to talk about today. Between that and Ad Tech, like, that is like kind of the Apache Druid sweet spot, right?
[00:02:14.570] – Alon Edelman
Yeah, amazing.
[00:02:15.820] – Reena Leone
And I know that you guys actually go way back with Apache Druid. When did you first start using it?
[00:02:22.400] – Alon Edelman
We actually started at version 0.8. We checked for you… Around 2016. We started with MySQL, which did not scale the way we expected, and we were looking for different solutions. And this is how we got to Druid.
[00:02:43.660] – Reena Leone
So when I was doing a little bit of my research, I noticed that you have some complex data pipelines with streams coming in from Apache Kafka. You have aggregation with Spark, you have data storage systems. Where does Druid fit in here?
[00:02:59.770] – Alon Edelman
So this is actually an escalation of starting with a rational database. So we started with MySQL and we looked for other things. And we are talking about something like 2013, right? It was ten years ago, and we actually pivoted into Cassandra. We started to create a lot of aggregations. This is the normal path, right? Create aggregation, predefined aggregation and reporting over the Aggregation. And that was pretty good. It worked for us, but the velocity was not great. Every time we wanted to change something, it took us a lot of time. We have to rebuild the aggregation, do everything from scratch, and it was extremely slow. So this is how we started with Druid, started something that aggregates the data. We don’t really care about the historical raw data, and we started to look for different solutions. And as a customer of Metamarket…
[00:04:04.580] – Reena Leone
Hmm mmh, going all the way back!
[00:04:06.510] – Alon Edelman
Al the way back, we were very intrigued of Metamarket infrastructure. And when Druid came open source and we understood a little bit what’s going on behind the scenes, I think we can call us a very early adopters of the system. We actually started when it didn’t compile, we had to compile it ourselves.
[00:04:28.390] – Alon Edelman
It was very early. But as we invested more engineering resources and we saw it solves our Velocity issues, once we start ingesting, we can do whatever we want in the data. That’s what bought us as the solution. We are going to go on. So this is how we started with it.
[00:04:50.160] – Reena Leone
Awesome. And so what aspects of Druid make it a good fit for you today?
[00:04:55.440] – Lioz Nudel
First of all, the main reason is using an aggregation database, time series database. So this is the main thing that we look for. And as Alon mentioned, we did it in the past with SQL, which was really a hassle, and that’s why we had Alon as a DBA at the beginning. And then the scalability and stability and costs were really important for us as well.
[00:05:20.090] – Reena Leone
Can you tell me a little bit about what made SQL challenging to do? Just for my own knowledge?
[00:05:26.810] – Alon Edelman
In one word, it doesn’t scale right. When you have a SQL database and go back to the numbers. Now we have 6 billion events per hour, right. Even ingesting them into MySQL today, it’s not possible. And even ten years ago, when we had less traffic, it was very challenging. So of course, we invented their normal path. Again, talking about 13 years ago, we had sharding and we triggers. Triggers and Bright and a lot of different solutions that we tried to resolve the issue, but eventually it was slow. Slow. Not in the performance side, that too, but mostly in the velocity. Every time we wanted to change something, it gets really complicated. So we look for something that will allow our programmers to move fast. And this is, I think, the main advantages of Druid, that we can move fast. You want something new? No problem. And we can get it done in hours or days and not weeks or months.
[00:06:27.800] – Reena Leone
That actually is a really good segue to dive in and talk a little bit about use cases like, let’s get specific here. Can you share some specific use cases where you’re using Apache Druid and what you’re using it for? Yes.
[00:06:40.540] – Lioz Nudel
So we have two main use cases. The one and biggest one that we started with is of course, Druid currently is the core of our analytical system. We used Pivot back then in the past, which was open source, and our engineers implemented our UI over Pivot and we needed many, many additional features and our UI team implemented them for us. This internal fork is being used both by internal and external customers, business customers of our, and it helps analyze all of the massive amounts of data that we have. This is the main reason using Druid, but currently we are using Druid even for analyzing our cloud computing costs.
[00:07:34.530] – Reena Leone
Awesome.
[00:07:35.120] – Lioz Nudel
Yeah, even in the past we had a blog post would Imply about this area.
[00:07:42.450] – Reena Leone
Oh, very cool. I’ll have to go find that one. That must have been before my time, but yeah, no, that’s fantastic. Has Apache Druid improved performance and scalability of your analytics infrastructure?
[00:07:56.560] – Alon Edelman
No, it actually completely replaced it. Before that, we had a lot of analytics systems, and now we just use it as a permission infrastructure. When you have a customer, it can access its own data, everything very secure. And when we have an internal system, it just can see all the data. And as an analytic infrastructure, it’s very easy to create reports and get all the info over there. So we use it both for the external and internal. No analytics almost at all, except Druid. Everything is done in there.
[00:08:31.320] – Reena Leone
Although I should mention, when we talk about Pivot, we did take it back. It’s now an Imply product.
[00:08:38.070] – Alon Edelman
When it was an open.
[00:08:41.830] – Reena Leone
Just for total transparency on the show. So people are like, let me see if I can find it.
[00:08:48.410] – Lioz Nudel
Oh, yes, that is now, yes, we brought through Nilio. There is an open source equivalent.
[00:08:55.030] – Alon Edelman
We actually created our own unique usage over it. So it’s not really Pivot on a more. It’s something very internal and very specific for our customers, internally and externally.
[00:09:06.710] – Reena Leone
Oh, actually, can you tell me a little bit more about that? That sounds really interesting. Which is also one of the reasons I love open source so much, because sometimes you start with one thing, it becomes something entirely different.
[00:09:19.620] – Alon Edelman
We have really a lot of features. For example, we have a lot of state machines that change the way that we take dimensions and metrics and according to the usage for the user, to the publisher, everything changes when you log in. We do some pre-fetching data so it will be faster when you first log in.
[00:09:42.330] – Lioz Nudel
We have several user limitations on the UI that limits the amount of queries that goes to Druid. For example, if you have two GROUP BYS together, we won’t open all of the second queries to limit the queries that goes to Druid.
[00:09:58.510] – Alon Edelman
Security wise, we changed everything from a security perspective, so it will be in accordance to our guidelines. I don’t know, there is a lot of behind the scenes. How do we access the Druid and where do we access which data source? We have some UI implementations that we can drag and drop more and we have like stop queries in the middle. And over the years it was like we are talking about 2016. Now we are 2023 and we have a UI team that works on it. So it keeps evaluating. Even now new features are being implemented.
[00:10:39.300] – Lioz Nudel
I think that the first big internal feature that the IT team created over Pivot was comparing data ranges, which was still missing in Pivot back then.
[00:10:52.530] – Reena Leone
So I imagine that you use a lot of different technologies depending on what your clients need, what publishers need. What were some key factors you considered when evaluating Apache Druid against other databases or analytics solutions?
[00:11:09.120] – Alon Edelman
So again, it was a long time ago.
[00:11:12.470] – Reena Leone
I’m taking you back down memory lane.
[00:11:15.350] – Alon Edelman
You’re the Druid historian. I am the company historian.
[00:11:20.650] – Reena Leone
That makes me feel good that you listened to one of my episodes when I said that.
[00:11:23.930] – Alon Edelman
I did listen to the episodes. I think most of them, I think if not all of them. So we actually evaluated the competitors even then, that are still the main competitors right now, which are Druid, Clickhouse and Pinot. The reason we chose Druid is because we did POCs on all of them. And Druid won both on cost and on performance. We got very good performance and reduced cost as part of the adtech community. It really fits into our needs. And actually, if it was a race, Druid was not the first one, but the only one that reached the finish line.
[00:12:09.210] – Alon Edelman
It’s not that we are not using other tools. We do have Clickhouse here and other utilities. But this Use case impression, clicks, bids, auctions, it really wins by a landslide.
[00:12:20.830] – Reena Leone
I mean, that’s great. And that’s kind of like that’s why I asked the question, because I know companies are using different technologies. It depends on the use case. That’s kind of the brilliance of open source or even just the real time data space, right? Obviously I am a little biased towards Druid, but I like to keep it real on this show. I like to be honest. I have transparency. Let’s shift gears a little bit. Have you run into any challenges with Druid and if so, how did you solve them?
[00:12:49.930] – Lioz Nudel
As we mentioned before, we have a huge scale of data. Even Druid eventually started to become slow for us in our use case and for solving this issue, we had to scale up and eventually the cost was pretty high. And then we started to think about an alternative solution. So we monitored our users and their queries and then we started to create smaller data sources that each of the data source has less dimensions in it, and then querying the specific data source walked way faster than querying a big data source. Our internal UI team developed over the open source Pivot, of course, a feature to query the right smallest data source that we have with this specific set of dimensions that the user needs, and then it solves the issue. That was one of the biggest implementations and features that we had with the UI team together.
[00:13:58.410] – Reena Leone
That makes sense. I always talk about petabytes of data in these large data sets. And that might not be what everybody is dealing with in terms of Druid, but you definitely have a lot of data going on. I like that you had so much data that you slowed it down. That’s pretty impressive, to be honest.
[00:14:15.760] – Lioz Nudel
Yeah, maybe another major challenge, at least back then, we had to create the whole infra on our own because there were no official, for example, chef cookbooks or ancillary automations or anything in the community. So we wrote everything on our own with our DevOps team. And it took pretty much time to create something stable, but it was it and we’re using it until today. But these days you have the official helm chart and you can deploy everything over Kubernetes. So it’s way easier than when we started.
[00:15:01.500] – Reena Leone
So I know when we were talking a little bit before, you had mentioned another one of the challenges that you had was around visibility. But you actually kind of came up with your own solution for that. Can you tell me a little bit more about it?
[00:15:14.150] – Alon Edelman
So, what we did, we actually created a Clarity like system. We take all the data and we put it into our own system, which of course is Druid. Yeah, for Kafka and Druid. And then we have a very good visibility system that also combines into our own system. And this way we can see where are the field points, what is slow, where is the slowest node? We can find problems in the infrastructure, problems in our system, problems in Druid, actually. And that took us to over the years into a new challenge, which is how do we test configuration changes, how do we test new versions? So we have an entire system that compares all queries against new clusters with the same data sets and compare the timing. We can run it with new variables when we change the infrastructure, when we add node, when we remove node, when we change the segment size, when we change the data sources. Everything is measured in order to obtain a good performance in a reduced cost. Always the reduced cost.
[00:16:24.410] – Reena Leone
That sounds so cool. We talk a lot about I mean, in this show, we’ve talked about operational visibility and figuring out where there are issues. And you built your whole system and whether to build your whole system. And you built a system off of Druid if I had known that, I would have had you come on my show for the visibility episode.
[00:16:43.890] – Alon Edelman
Okay, next time.
[00:16:44.850] – Reena Leone
Maybe next time. Next time. I feel like every time I do, it completely different. Every time I do an episode, I’m coming up with more ideas, like from talking to folks. But that’s not the only thing. You’ve been doing Lookups a little bit differently too, haven’t you?
[00:16:59.980] – Lioz Nudel
Yeah, actually we started with our internal Lookup system after a failure, an incident that we had. Until that point, we managed our lookups manually in the UI of Druid. And it really sucks because we have around 86, 87 Lookups, and it’s really a big hassle to check that everything is correct there. So we managed to create a new internal tool while with coding, we implement the new lookups into Druid. Using the API calls of Druid, we can insert, delete or edit any of the lookups. Right now, everything is in Git and managed by Jenkins.
[00:17:52.170] – Reena Leone
Okay, cool.
[00:17:53.270] – Alon Edelman
Our configuration at all is managed by code, but Lookup was like something a little bit out of the ordinary, so we actually moved that. And of course, the rules, the deletion rules, the load rules, that every time we migrated and someone did something manually, we suddenly got a cube that a data source that grew because we forgot it to add the delete rule. Or we got a lookup that no one used for a while and suddenly we get errors because someone deleted it or changed it. So now everything is managed by code and using Druid API, which is amazing. So everything is done by code. Every time something comes up, it’s compared and removed and changed. No one touches it manually anymore. She’s nice.
[00:18:39.420] – Reena Leone
Very cool. Yeah. We’ve been talking a lot about Kubernetes on the show lately because it’s been a hot topic in the community. And speaking of the community, everybody has been working hard over the last even just like the last couple of years, for improvements to Druid, the addition of the multistage query framework, and more recently in Druid 26, we added schema auto discovery, which is like flexible schemas. And then also shuffle joins, which I’m going to do an episode in the future about druid and joins. Is there anything that you’re hoping to see in future releases?
[00:19:17.510] – Alon Edelman
Okay, so we are usually a very early adopter. We keep ourselves on the toes and keep improving. But we are pivoting the entire infrastructure to GCP, and that took a lot of resources. So we are still stuck on 22. So we didn’t have a time to evaluate everything. But we are very excited right now. We are finishing the migration to GCP in 26 and we are going very excited about using, of course, the Kudras equivalent. We invested so much time in Kudras, so we are very excited and we are going to implement it. I think first thing, using the middle managers on Kubernetes and of course, we are very excited about the UNNEST and array system. We do have our own solutions right now using arrays and tags. And we are very excited to pivot out of it. Pivot.
[00:20:11.170] – Reena Leone
Taking it right back. Taking it right back. I love a good callback, let me tell you.
[00:20:17.910] – Alon Edelman
So that’s what we are going to invest time as soon as we are finished integration, we already tested. We already know that it will give us a better performance with reduced cost. We are very cost oriented here as an adtech company, so we know it will be better. So as soon as we are going to finish integration, which happened today, so in a week or two, we are going to start moving into the new features.
[00:20:44.650] – Reena Leone
Awesome. I guess by the time this is up, that will already hopefully be happening for you too.
[00:20:50.460] – Alon Edelman
It will be amazing. We are very excited about the new features.
[00:20:53.800] – Reena Leone
Awesome. Well, thank you guys for joining the show today. This has been awesome. I always love to hear about how folks are using Druid, especially if you’ve been around for a while and been using it since earlier releases, to your point being the unofficial Druid historian. So thank you again for joining me on the show today.
[00:21:14.310] – Alon Edelman
Thank you. It was a pleasure.
[00:21:16.410] – Reena Leone
All right. If you would like to know more about Apache Druid, please visit Druid Apache.org. And if you want to know more about Imply, please visit imply.io. Until next time, keep it real.