Why Apache Druid is Not Like Other OLAP Databases
If you’re here, you likely know that Apache Druid is a high performance, real-time analytics database. Even if by some chance you don’t know Druid, you’re definitely familiar with some of the companies using it like Netflix, Salesforce, Target and more. But what are they using it for? And why did they choose Druid in the first place?
Apache Druid is probably best known for fast, slice-and-dice analytics and subsecond query response on large data sets. Druid’s speed and scalability make it the right choice for the following key use cases: operational visibility, rapid data exploration, customer analytics, and real-time decisioning.
On this episode of Tales at Scale, we dive into each use case with real world examples. Listen to learn about:
- How Game Analytics, the number one analytics tool for game developers that monitor about 100,000+ games, went from managing two separate systems for historical data and real-time data to one that can support high concurrency with Druid, reducing cost and eliminating the need for precomputation that was stifling their delivery of real-time insights for game developers.
- How Atlassian started using Druid for Confluence’s analytics and stopped needing to pre-process data to ingest into PostgresDB from Amazon Kinesis since Druid has a direct connection to Kinesis. They are now able to handle close to a million queries a day coming from 250,000 customer base
- How live video streaming platform Twitch was able to use Druid to build a system that can ingest 10 billion events a day seamlessly without any issues, query across the terabytes of data on a single day and handle concurrent users making 70,000 to 100,000 queries a day.
- And how Druid’s connection between Apache Kafka and Amazon Kinesis streaming data pipelines and its ability to handle high concurrency coming not just from users but from machines, make it the right fit for real-time decisioning at scale.
- Atlassian Switches from PostgreSQL to Druid for Customer Analytics
- Data for All: How Twitch Used Imply to Build Self-Service Analytics
- Making the impossible, possible: A GameAnalytics case study
About the guest
Muthu Lalapet is the Director WW Solutions Engineering at Imply where he leads a team of sales engineers helping customers build real-time analytics applications with Apache Druid. Prior to Imply, Muthu was the Principal Data Architect at MapR Technologies, where he assisted customers in all aspects of their Big Data / Hadoop journey including finding the best architectural and design solution to meet their needs for real-time data processing, data flow, analytics and data archival.
David Wang is the VP of Product Marketing at Imply. David is responsible for the company’s positioning, product messaging, and technical content across Apache Druid and Imply-branded products. Previously, he served in leadership roles at Hewlett Packard Enterprise (HPE), Nimble Storage, and GE Digital. David repositioned HPE Storage for cloud data services; drove category creation for predictive flash storage at Nimble; and transformed GE’s municipal lighting business to smart city, analytics applications, leading to its eventual spin-off and acquisition. He holds a B.S. in electrical engineering from the Georgia Institute of Technology and an M.B.A from the University of Southern California.
[00:00:01.130] – Reena Leone
Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with analytics way beyond your basic BI. And I’m talking about analytics applications that are taking data and insights to a whole new level. On today’s episode, we are doubling down on Apache Druid. When to use it, what to use it for, and where it fits in in the very crowded database space. Apache Druid is a real time analytics database designed for fast slice and dice analytics, or OLAP queries on large data sets. When I say large data sets, I’m talking like petabytes of data. Thousands of companies are already using Druid today, from Netflix to Salesforce. But what is Apache Druid best for? What types of projects? What data sets are Druid users working with? And what are companies doing with Druid? Or maybe the better question is, what should they be building with Druid? To answer these questions. I am sitting down with David Wang, VP of Product Marketing at Imply, and Muthu Lalapet, Director of Worldwide Sales Engineering at Imply. David Muthu, welcome to the show.
[00:01:05.160] – Reena Leone
Today we’re diving into Apache Druid and what makes it unique. It’s kind of like our own, like how I built this episode for Druid. So David, I’m going to start with you. When we had Eric Tschetter on here, we talked about the origins of Druid and why it was built to make it a very short story. At the time, none of the existing databases worked for what they needed. They just weren’t fast enough. Fast forward to today and there are like somewhere around 300 databases on the market. What separates Druid from the rest?
[00:01:38.180] – David Wang
Okay, first of all, thanks so much Reena, for having me on the show. That’s a great question. There are a ton of databases out there, and reality, it’s about a change of use cases. When people think about analytics, what typically comes to mind are data warehouses. And so there’s obviously been an evolution of the data warehouse market, from the teradata on prem appliances, to vertica, to snowflake, to data bricks, to Presto Trino. But all of those technologies all kind of fundamentally address the same exact workload bi business intelligence, the executive reports that every organization needs, every department uses for kind of historical look back reporting and dashboards. And that is not what Druid is all about, or something very special and something very different than that classical analytics workload.
[00:02:29.430] – Reena Leone
So what would Druid be used for, though? If we’re moving away from your traditional BI, what are we looking to use it for? Where is it best suited?
[00:02:42.690] – David Wang
Yeah, I think it really speaks to the kind of three core differentiators that separate Apache Druid from every other OLAP database out there. And we frame this up as analytics meets applications. Like, what does that mean? It’s really about trying to do a large scale aggregation group by type queries on high cardinality, high dimensional data. But doing that with the responsiveness of an application, doing that with the concurrency that you would expect of a user facing application, doing that on operational real time streaming events. And that’s where it’s something very different. Because when you have a lot of data and you want to be able to crunch that really quickly against a data warehouse, you’re going to probably look at shortcuts like building an OLAP queue, for example, or only analyze the most recent data or rely on an aggregation and then slice and dice off the aggregation. But when you’re trying to do analytics that involve lots and lots of data, but you have to have that really fast responsiveness, then that’s where Druid comes in. And of course, real time data is a very big element for Druid too. That’s driving a lot of the use of Apache Druid.
[00:03:50.350] – Reena Leone
I was actually just going to ask you, like, why do data warehouses or maybe like a data lake house fit the bill? But if we’re talking about a whole new use case, that totally makes sense to me. Muthu, you get to work on the things that David just explained, right? Because you handle proof of concepts for us, and so you get to see some of those use cases manifested in real life. Before we get into examples, when companies come to imply what challenges are they facing, what are they kind of asking us for help with?
[00:04:26.650] – Muthu Lalapet
Sure, definitely. Before I get started, I want to let everyone know that I love the How I Built This podcast from Gerhard. It’s a pretty good one. And I listened to the first episode of the Tales at Scale.. That was a good one, too. All right, so as part of working at Implied, for the last close to four years, I’ve done implementations, and now I’m heading up the sales engineering team where I get to see all the POCs that we execute on a day to day basis. Like I said, I get to see what our prospects, our customers want to do with their data. And mostly it falls under those three buckets that David called out. They want to do subsequent analytics on terabytes of not terabytes of data, and also doing real time analytics generally. It’s with Kafka, sometimes with Pulsar as well. And most importantly, it’s about providing customer facing analytics. When it comes to customer facing analytics, your colleagues have to be really fast. You also have to support concurrent queries. That means a lot of users will be accessing the application. That means your system should be in a position to support thousands of concurrent quotes at the same time.
[00:05:41.850] – Muthu Lalapet
So mostly it falls with these three buckets, if not two buckets when prospects come to us.
[00:05:47.550] – Reena Leone
When we talk about some of these use cases, one that comes to mind off the top of my head is operational visibility at scale and maybe it’s just because this is the name of the podcast, but let’s kind of go into that one to kind of start us off. When someone is looking to build applications to help with that, what kind of data sets are they looking with? What are they trying to solve?
[00:06:12.010] – Muthu Lalapet
Instead of looking into a particular data set or a use case, I’m going to give you some examples around what we are seeing. Okay?
[00:06:20.200] – Reena Leone
I love examples. Go for it.
[00:06:22.730] – Muthu Lalapet
All right, before I talk about examples, let’s talk about what operational visibility really means, right? We all heard the term about operations, what operations mean. Operations is generally a work of managing the inner workings of something, right? Inner workings of a business or inner working of a department or a component or a particular unit inside a company. How do these operations people do their day to day work? What they normally do is they define some metrics around what they are measuring. They collect the data and they analyze that particular data. And that’s how they make decisions on a day to day basis using these operations metrics. So let me give you some examples. Think about a product manager. A product manager in a SAS organization. They would want to know how their product is performing. Is it flow? Where are the customers getting stuck? Maybe they rolled out a new feature yesterday and they want to see how many people are using the new feature. Or they may be doing some AB test, right? So to do all this, they need to collect some metrics. Normally in this kind of a use case, it is about clickstream events.
[00:07:30.610] – Muthu Lalapet
So they will be collecting clickstream events on the SaaS product where people are clicking what things they are using and what they are not using. If your SaaS product is really popular, this could easily turn into billions of events per day. Let’s think about Confluence. Now I’m using Confluence on a day to day basis. Confidence in Atlas, in product. I’m using Conference in Jira on a day to day basis. From a conference perspective for their clickstream data, it would be like billions of events for a single day. So we need to build some kind of a system that actually measures its metrics, analyzes it and gives it back to the product manager so they can make some decisions on a day to day basis. That’s what they mean by operations. I’ll give you another example from the game developer perspective.
[00:08:13.250] – Reena Leone
Okay, now you’re speaking.
[00:08:16.990] – Muthu Lalapet
You know what, I used to fancy myself as a game developer. About ten years back I built a game as well in HTML Five, about ten to twelve years.. There used to be like a thousand people playing every day. And in that game I tracked every event, every word a particular gamer was trying to make. And just with 1000 users a day, I used to get about 50,000 to 60,000 events a day, just 1000. And users think about if your game is really popular and 20 million people are playing the game every day, then you’ll be getting about a billion events a day. So from a game developers perspective, what they want to see is they want to see where the gamer is dropping off. Are the levels too hard? They could be like 20 or 100 levels. What does a funnel look like? Thousands of people came in. All thousand played level one. Only 500 went to level two, right? So they want to see all that funnel, analytics, retention ratios, what is the average revenue per user? All these are different insights that they want to see and they want to make some decisions on how to make their game better.
[00:09:23.090] – Muthu Lalapet
That’s what we mean by operations. So you’re looking into data about your product or about your game, about something that you are measuring on a day to day basis, on a real time basis, and trying to make some critical decisions. And then we talked about operations visibility. What is visibility from a game developers perspective or a product manager? They are not like a day to day data analyst. They aren’t going to be running sequels or writing like 20 lines of sequel or 30 lines of SQL code and wait for five minutes or ten minutes to get the results back. They want data to be presented in such a way as easily accessible and easily analyzed. And it’s really, really fast from a query perspective so that they can analyze the data and make some decisions and move on. So that’s what we mean by operations visibility. Before I go to examples, I’m going to talk about what are the things that’s needed to build an operational visibility application?
[00:10:20.540] – Reena Leone
It’s like you took the question right out of my mouth. Like what do people need to build this?
[00:10:25.190] – Muthu Lalapet
Like I said, my simple game. 1000 people are playing, it’s only about 50,000 events a day. But if millions of people are playing, it’s about a billion events a day. The first thing is you need to have a mechanism to collect this massive amount of data. That’s the first thing that you need. The second thing is most often these are real time data because the analysts or the product manager want to see what’s happening right now and then make the decision. So you need a real time way of collecting these massive amounts of data. That’s the second thing that you need. The third part is these operational analysts or the product managers are making decisions, looking at this data, right? So the queries have to be really responsive. It has to be really fast. They don’t have a ten minutes or a five minute period to sit there and see and wait for the results. They want to know right away. The fourth part, think about bigger game studios like EA or Tencent. There could be thousands of product analysts and thousands of game managers who are looking into the data. That means your system should be in a position to support concurrent queries on a lot of users.
[00:11:28.960] – Muthu Lalapet
Higher concurrency is very important. And finally, for the visibility part, you need to have an easy way to present this data back to them. An easy way so that they can understand and look at the data and make sense right away. So these are five things that you need to build a real good operational visibility application.
[00:11:46.690] – Reena Leone
So we’ve been kind of talking about the hypothetical, but do you have any examples of current implied customers or folks that have done this at scale?
[00:11:56.330] – Muthu Lalapet
Oh, yeah, absolutely. A lot of our implied customers are doing this at scale. One of them I’m so franchise myself as a game developer right now. So let’s talk about the gaming thing. One of our customers is game analytics. Game analytics, for those of you don’t know, is the number one analytics tool for game developers. They provide insights for your games. If you are a game developer out there, you should definitely check them out. It’s a great tool. Let me give you some stats around Game Monitor. In a monitoring system, they monitor about 100,000 plus games. On any given day, there are about 200 million daily active users playing the games that they are monitoring. Like I said, in my game, with 1000 users, it’s about 50,000 events and 20 million users. It could be 1 billion events. Think about 200 million daily active users playing a game. And that’s what they’re monitoring with the help of a driver. And they’re also getting 20 billion events per day. 20 billion events per day. And they store, they do this analytics on about ten terabytes plus data per day. That’s the scale that we’re talking about here.
[00:13:03.070] – Reena Leone
That’s crazy. I’m trying to think of it, just trying to go through that without something like Druid would be a lot. I’m understating it there.
[00:13:13.680] – Muthu Lalapet
It would be a lot. But they tried to do this without Druid. Let me talk about what they did before and what was the challenge with that. And then we’ll talk about how we fixed it, right? So what they did before is all these raw data, the raw events, they are coming from these games. When you play a game, the lots of stuff the game developers are collecting, what are you clicking, what level you’re playing, all that they’re collecting behind the scenes. That’s what we call a raw event. These raw events are getting into their real time system. They have a homegrown. Before Druid, they had a homegrown real time tool for real time processing. They do a lot of precomputation in that particular tool and they store all the data into DynamoDB. So for those of you who don’t know, DynamoDB is a key value store that is really good for faster writing and also faster reading on a single key basis. But that’s not what analytics is all about. So anyway, they use this key value stored with this DynamoDB and they stored all that pre calculated events. The key thing is the precomputed results.
[00:14:22.650] – Muthu Lalapet
So what I mean by precomputing is the precomputing is about determining the queries the users are eventually going to make. So you are trying to be like some kind of a psychic here, right? You’re trying to determine what queries the users are going to make and then you’re going to calculate the results of those queries ahead of time. And then you’re going to store those results for faster retrieval into DynamoDB. Why you’re doing that is DynamoDB is not good for analytics reason analytics purposes. It’s really good for faster reads and faster writes, no doubt, but not for analytical reasons. They cannot do group bys, they cannot do any filtering, they cannot do any dimensional filtering. It is really good for single record retrieval, not for doing analytics around multiple records. So they have to do a lot of this pre-computation much beforehand.
[00:15:10.200] – Reena Leone
Well, that makes sense too. And that’s why there are so many different databases on the market right now, because there’s no one database to rule them all, right? There’s like several that do very specific things. And that’s why Druid has its own sort of fit versus DynamoDB.
[00:15:26.720] – Muthu Lalapet
Absolutely. Not only that, in the previous solution, what they did is they also had another system for supporting historical queries. So the one for real time queries and one for historical queries. So they had two different systems to support real time and historical. And supporting higher concurrency was also an issue. And they face challenges with query response time and also a lot of work that needs to be done in pre-calculating. And as the scale went up, the time it takes to pre calculate the results also went up. That means they’re not providing any real time insights. That means the game developers and the game managers and the product managers cannot make any decisions looking at the data. They’re looking at the data about one or two days old, not like real time data. That’s a challenge that they faced.
[00:16:16.490] – Reena Leone
And I mean, if there’s anybody that’s the most patient group of people, it’s people playing games, immense amount of patience when things go wrong.
[00:16:29.610] – Muthu Lalapet
So that’s where Druid comes in, right? So Druid, it’s really good about, like we said before, to provide subsequent analytics on terabytes of data, real time analytics, and also for higher concurrency. They looked into multiple databases. One of the databases they looked into was a time series database. But the time series databases are really good at operating on the time dimension, but not necessarily good at operating on dimensions other than time, like filtering on other dimensions and grouping by other dimensions. They aren’t good at it. And they looked at Druid. We helped them evaluate Druid as well. The major reason why they selected Druid is number one, you can do real time and historical analytics in Druid. You don’t have to have two different systems that reduced a lot of cost for them in preparing those eto, maintaining those systems and things like that along with it. As you know, Druid provides direct integration with Kafka and Kinesis. They’ve been using Kinesis for the real time pipeline, and we directly plug into Kinesis and we can ingest the data right away. So when an event shows up in Kinesis, you can query that data, query that data event in about a second, if not less than a second.
[00:17:45.980] – Muthu Lalapet
And the third reason why they chose Druid is you do not have to do all those pre computing the results. From my perspective, that is the biggest winner. Because when you have precomputing results, you can pre compute results for maybe 20 million records. But think about precomputing results for 20 billion records. It’s going to take time. You do not have to do that. They completely drop that component from their stack after they implemented Druid because Druid takes care of that already for you on the ingestion and also on the query side, these are three reasons why they use Druid.
[00:18:19.810] – Reena Leone
I mean, that sounds like a win win all around not to make another gaming pun. So I want to shift gears for a minute to talk about so we talked about operational visibility at scale, which is more of an internal process, right? But another reason that people are coming to us for expertise on Druid is for customer facing analytics. They are providing a service for their customers to analyze and visualize their data. Can you tell me a little bit more about that?
[00:18:50.910] – Muthu Lalapet
Yeah, sure, absolutely. Customer facing analytics is really an important use case for us and it is more prevalent than we think. And to be honest, we as individual consumers, we are using a lot of customer facing analytics tools on a day to day basis. I’ll give you some examples from my own personal thing, right? We all have used Google Maps, are still using Google Maps. And Google Maps is a little known feature called the Timeline. If you enabled it, it just records wherever you go and you can go back in time and see where you went and things like that. So I clicked on the Timeline and I was looking at it a couple of days back, and I’ve been using Google Maps from 2010. It was about twelve years, right? So I clicked on Timeline, it showed me a little analytics on my timeline. It told me the last twelve years I visited about 2000 places and it gave my top ten as well. And guess what? I know I really like one of the small restaurants here in the Valley. I live in Phoenix. It’s a small kitchen in the back of a grocery store.
[00:20:00.830] – Muthu Lalapet
I know that I go there quite often. But you know what, I’ve been there about 190 days in the last ten years. That’s like half a year. That’s a lot.
[00:20:10.770] – Reena Leone
You’re like one of their best customers. I wouldn’t want to know that. I don’t want to know how many times I go to Whole Foods in a week.
[00:20:16.200] – Muthu Lalapet
So that’s an example of a consumer facing analytics, right? I’m using Google Maps and they’re giving me analytics on where I’m going and how often I’m going there and what my top ten is. Another example I can give you is YouTube. I’ve been using YouTube and there’s a feature in YouTube mobile app called Time Watched. If you click on that, it will tell you how much time you have been watching and what your viewing behavior is. I kind of looked into it and then my viewing behavior is about 23 or 25 minutes a day. I’ve been watching YouTube for like 30 minutes a day. That’s my viewing behavior, right? So these are examples of consumer facing analytics. I think I’m giving a lot of information about my personal YouTube behavior and things like that.
[00:20:59.110] – Reena Leone
First of all, 23 minutes on YouTube is like nothing. So you’re not giving too much information. I can’t imagine how many people are listening right now who didn’t know about these things, are probably looking them up right now and are very surprised because I’m going to do that as soon as we’re done with the show.
[00:21:14.070] – Muthu Lalapet
Definitely. I want to give one more example because this relates to operational visibility too. So when we talk about operational visibility, what did I say? Operations is about looking into something and trying to make sure whether that something is working optimally, right? Lately I have been very interested to see how my body is operating. So I want to see what’s inside my body and how that’s operating. So to make the app you have to collect some metrics, right? So I’m collecting some metrics about my body on a day to day basis, on a real time basis. So I’m using something called a continuous glucose monitor. It’s from a company called Nutrients. Again, I’m not diabetic or anything like that, but I want to monitor my food intake and what’s the impact it has on my glucose levels so I can alter my eating behavior and food habits and things like that. So I’m doing my own operational visibility on my own body right now. But anyway, this also ties back to the consumer facing analytics. This nutrition app gives me a real good dashboard on a day to day basis and tells me, am I doing what’s my metric, what’s two days back or five days back on a weekly average, monthly average?
[00:22:21.930] – Muthu Lalapet
And what food is bad for me, what food is good for me. It gives me a lot of good analytics. These are the examples of consumer facing analytics and talking about right. So me as an individual, I’ve been using all these apps, I’m used to all this consumer facing analytics. So let’s say now I go to Office and I’m working from home, but now I go to my office room and log in to imply and say I’m using a SAS product. Let’s say confluence, right? Confluence or Jira. From Atlassian I create a Wiki and Confluence, let’s say. And if I don’t see analytics on my Wiki, I will be super bummed because on a day to day basis, I’m seeing all my viewing behavior, my Glucose behavior, my sleep behavior. But now I’m using something at workplace and if that something is not doing the analytics based on what I’m doing, I hope you really merited. But. Good for Confluence. They do provide the analytics for their product. So that’s what we mean by user facing analytics.
[00:23:25.570] – Reena Leone
You’re the most optimized human being I think I’ve ever met. Fully optimized. David, you’re trading for a Triathlon. Maybe you need some of these, maybe you need some of these customer facing analytics to optimize your Triathlon performance.
[00:23:43.350] – David Wang
Absolutely, I mean, I obviously do look at all that data as well. I think there’s something to be said though in terms of how does that change requirements. Right? Because if you are a developer architect charged with building an externally facing analytics application, does that actually change the requirements on what your backend data store needs to do? And I think that’s something that’s really important for us at Imply because as we think about how people use Druid, that’s a really important choice point because I think move to tell me what you think otherwise. But I do think the requirements on the underlying data store is much different when you’re talking about serving customers than when you’re just simply serving internal stakeholders.
[00:24:29.370] – Muthu Lalapet
Absolutely. Yeah, that’s what I’m going to talk about next. Right? So what is that you need to do to build this kind of a customer facing analytics application? The first thing is you need to have a system that supports low latency queries. No one wants to open up a YouTube app and click on something and wait for ten, if not even for 5 seconds. Otherwise you’re going to go to some other streaming system. Right? Same thing here. It has to be low latency. A query should be really fast. I don’t want to open up an app and go to the analytics section of the app and wait for 10 seconds there, that’s number one. Number two, higher concurrency. Think of these apps, right? There are lots of sleep tracking apps, almost everybody using those nowadays. So a lot of people are using Fitbits or Hours. So these apps are going to get thousands of queries in any given moment, in any given second. So these apps should be in a position to support higher concurs, even for the analytics workload. The third, you have to have an easy way to present the analysis back. So you’re getting a small screen in a mobile system or in a web system and also you’re talking about general consumers here and other examples.
[00:25:40.660] – Muthu Lalapet
But from an enterprise, consumers are talking about product managers or analysts. You need to have an easy way to show that massive amounts of data and to synthesize that and present it in a concise way. The fourth one in almost all the cases is all real time data. They want to know what’s happening right now. Like the glucose monitor. I said I want to know what’s happening right now. I just ate something in the morning. I want to know what’s happening. Maybe I ate a cookie. I want to know what’s the impact on that in my body from the glucose level perspective. So I want to know what’s happening right now. So real time is a really critical part of these things. So these four low latency, higher concurrency, real time data and the easy way to present this, present this data back to the consumer is really important. I’m going to move on to giving kind of a specific example.
[00:26:26.970] – Reena Leone
I was just going to ask you that. I think one that might come to mind, I mean, I know we’ve mentioned them a few times is Atlassian, right? Because spoiler, they’re a customer. So we have some good insights on there. But can you use them as an example?
[00:26:42.020] – Muthu Lalapet
Absolutely. So, Atlassian, I think everyone knows what Lassian is. For those of you who don’t know, they have a product called Confluence. They have other products like Jira too. But. Let’s focus on Confluence. Confluence is Atlassian’s cloud based Vicky and a collaboration tool that they sell to enterprises. So we are a customer of Confluence that imply we use Confluence and every team uses Confluence to create Wiki pages. One thing that Confluence provides is when you go to the Wiki page that you created or anyone else created, you would have noticed that at the right below the title, it says how many people have viewed the page and you hover over that or you click on that, you get a little bit more detailed analytics. Like what is a view over time, who are the top contributors, who are the top commentators? So you get all this analytics on your Wiki page. That entire analytics is backed by Drew. Like I said, that analytics application should be in a position to support higher concurrency. Think about Confluence. Confluence has 250,000 customers all across the world. If you assume there are about 40 people in each of them using it, that’s about 10 million individual users.
[00:28:00.560] – Muthu Lalapet
And if, if just 10% of that click on that analytics on any given moment, that’s about millions of queries that you are going to get the system. So they need to support higher concurrency.
[00:28:10.080] – Reena Leone
I was going to say, weren’t they on postgres before? Like I have a feeling yeah, before Druid they were.
[00:28:16.370] – Muthu
On Postgres. You’re correct. They’re on postgres. Before I’ll talk about what challenges they faced, but let me talk about the scale first so that people understand the challenge.
[00:28:24.210] – Reena
The name of the show. Tell me all about the scale.
[00:28:26.310] – Muthu
Exactly the name of the show. Right? So that’s why I want to focus on the scale. So the first scale I talked about is like millions of queries coming at any given point in time. The second thing is that Confluence is a cloud-based product. It’s a SaaS product. It’s a web-based product, a web native product. So your system should be really fast. It should be in sub-second. So when you click on something, it should respond right away. So sub-second query response time is really important to their use case. The third thing is how are they making all this analytics happen? They’re collecting all the clickstream data, they’re collecting on who’s clicking on the Wiki page, what is the top Wiki page, how many people are contributing? So that means clickstream data. It’s a massive amount of data and real time as well. So higher concurrency, low latency queries and real time. All these are the three different things that they need to solve. But before coming to Druid, they were using Postgres. Like you said, Reena, the architecture looks something like this. They collected clickstream events from Confluence. They pushed it into what they call a real time processing hub.
[00:29:29.650] – Muthu Lalapet
They were also using Kinesis and they did some processing on top of it. They augment some data and they push the data back to Kinesis. And then there is something called some service process. Think of them as ETL. They take the data out of Kinesis, do some ETL work, prepare the data for Postgres and then write it to Postgres. That’s what their architecture was before Druid. But I’ll tell you what, there are challenges in this architecture like what we discuss about game analytics. Number one postgres is an OLTP database. It’s not an analytical processing database. So they are good at processing any OLTP database. It’s good at purchasing a single row. What I mean by single row is I want to select an employee record where employee ID equal to 51. By the way, that’s my employee ID at Imply. So that’s a single record. They’re good at purchasing a single record, but they’re not good at, let’s say select. Let me give you another example. Not the employees, but they’re not good at aggregating or grouping by queries. I’m not able to give an example on the file right now, but they are good at aggregating and grouping by examples where they have to munch through a large number of records.
[00:30:45.780] – Muthu Lalapet
But that’s what the conference use case is all about. So OLTP databases are really not good at processing larger sets of data. Because of that, they were getting slower query, response time, a lot of queries. They were performing at 62nd, if not about 72nd range and their entire conference analytics page was showing the spinning wheel of death, whatever we call that, it’s still loading. Still loading.
[00:31:06.140] – Reena Leone
We all live in fear of the spinning wheel of that.
[00:31:09.030] – Muthu Lalapet
That’s one of the challenges they face with postgres TB. The second challenge is that Postgres DB does not support real time data. I told you about that existing architecture where the data is coming in Kinesis and they have an ETL that takes data out of Kinesis, does some ETL work, prepares the data and puts it into postgres. So they need to do preprocessing in that ETL step to prepare the data out of Kinesis and then try to postgres. They’re spending a lot of resources on that. When I say resources, it’s not only about machine resources, it’s also about human resources, right? Maintaining, creating, developing, bug, fixing all that. They’re spending a lot of time on that as well to prepare the data. The third challenge that they faced is concurrency. When they were supporting maybe thousands of customers, that’s great, but now they’re like supporting 250,000 customers worldwide and postpress is not meeting this challenge. When it comes to concurrency, these are three challenges, what they faced in building.
[00:32:03.340] – Reena Leone
That application with all these examples and one thing that I can’t overstate this, we’re talking about massive amounts of data, right? We’re talking about huge data sets, especially now with the prevalence of real time data. And so one thing that Druid is being used for and that Druid is well suited for is rapid drill down and exploration, right? Because you need to kind of figure out what’s going on and how do you even deal with that. Do you have any examples that you could share about rapid drill down or how that would work? Because I feel like that’s pretty critical to how people make decisions and it seems to be a challenge I think that has been coming up more recently.
[00:32:47.610] – Muthu Lalapet
Sure, we can definitely talk about that, but I want to address how Druid solve confluence challenges and then I will definitely talk about the rapid Drill down. So I told you that in postcards DB they are doing a lot of pre-processing to prepare the data to ingest into postgres DB from Kinesis. But in Apache Druid you have a direct connector to Kinesis. You don’t have to do all the preprocessing, you simply point to the Kinesis stream and say how the data looks like, define your schema and you can do a little bit of a transformation. It’s not a problem at all. Druid can ingest the data right away and it will make that event queryable in about less than a second so that you can run analytics queries on that real time data combined with historical data as well. The second thing I talked about is concurrency. Like I said before, there are about 250,000 customers that support and about could be easily about 10 million individual users. And they’re getting close to about a million queries a day. Million queries a day. And Druid is really capable of handling all this concurrency because of our unique scatter / gather approach that we built in our Druid system.
[00:33:57.530] – Muthu Lalapet
And the third thing, issues that they faced, like I said before, Postgres, was an OLTP database. It’s really good at processing a single row. But whereas Druid is a columnar database, the fundamental design and rule is a columnar database. And we are engineered to process multiple records at the same time and provide all the analytical functions like aggregates and group bytes and filtering on top of it. So using Druid what, they’re able to solve all those three challenges? The queries are responding in about less than a second right now. That’s a clear win for Atlassian, for their customers, and also for the end users like me, who’s using Atlassian on a day to day basis.
[00:34:37.970] – Reena Leone
Let me just recap there. So pre-Druid sometimes it could take up to 60 or 70 seconds. And now they’re running at sub second.
[00:34:46.760] – Muthu Lalapet
Now they’re running about close to a second. Close to a sub second. It’s about a second like that.
[00:34:51.820] – Reena Leone
Wow. That is a massive improvement.
[00:34:54.770] – Muthu Lalapet
[00:34:55.680] – Reena Leone
Okay, so I’m jumping ahead because I want to pick your brain literally about every possible scenario that we can use Druid for, where it performs best. So, like I said, let’s get into rapid drill down. I have data on the brain right now. And so how does that work with Druid? How is Druid suited for that? Especially when we’re dealing with, like I said, massive data sets?
[00:35:22.490] – Muthu Lalapet
Sure. Before I give you examples, let’s talk about why we need this capability, why we need a Rapid Drill down our self server seminar at all. So, a couple of years back, I worked for a big insurance company, a great company. I still love them. I was helping them to build a data warehouse system. So this is what you generally see in any enterprise. There will be a data analyst team whose day to day job is to run queries against a data warehouse system using some bi tools and prepare data. And this data analyst will be probably in my team at the time, it’s like five or six data analysts, but the entire team would be supporting the entire organization. Organization like product managers and operational analysts, and actuaries claims analyst, insurance product analyst, pricing analyst. So there are different kinds of analysts in an insurance organization. And they did a job, which is to make decisions on how to price a product, how to price a coverage, how to adjudicate a claim, how to actuaries are looking into the future and trying to design the product for the future based on the data, what they see today.
[00:36:42.430] – Muthu Lalapet
So their day to day job is not to run some SQL code. Their day to day job is to make decisions in the data, what they have. So what they normally do is they raise a request to my team, and the team of data analysts will say, hey, I want to see the policies in California in this particular county because a home fire happened, let’s say. Last year for the last year gave me the list of policies for data analysts that would be going and running a SQL query on the data warehouse and getting a list of policies in a CSV or an Excel or something like that and share the data back to the product analyst. So this takes maybe probably 2 hours, 3 hours, if not like a day, right? Like I said, we have only about six data analysts supporting the entire organization. So sometimes the data analysts are busy working on something else, right? So the data points have to be queued. Now the park journalist looks into the data and now he or she says, this is great, by the way, can you just give me data only for this zip code and possibly people 40 plus?
[00:37:44.040] – Muthu Lalapet
Because that’s our demographics right now. Two days has gone by, the data analyst now goes and looks at the data and runs some queries and gets the data back again. So like this, back and forth it goes for a couple of days, if not a couple of weeks, until the product analyst gets the data that they need to make the decisions. This is really inefficient. I was like twelve years back I was doing that, I was part of the data analyst team and I was managing the team was doing that. So that was a general workflow. Even today I see in enterprises, so when you go to any enterprise, you can see the data leaders or the CIO or the CTO giving big mission statements saying that they want everyone to take decisions based on data and they want to empower everyone in the enterprise to make appropriate decisions backed by data. That’s a great statement to make, but almost everyone says that nowadays. But generally what happens in the enterprise is what I described, what I described before, right? So there’s a data analyst and you spend about a couple of days, if not a couple of weeks, to get the data and make decisions.
[00:38:51.890] – Muthu Lalapet
Yes, they are making decisions on data, but slow, it’s taking time. And by the time they make the decisions, the market moved on. So they are making decisions on something that is not valid anymore. And lastly, they also need to look into massive amounts of data to make the decisions. So that’s why this self service rapid drill down analytics is really important so that we can empower those product analysts, operational analysts, the insurance company, like I said, the clients analyst, to make the decision right away with the data, what they have, which they can access directly rather than depending on a data analyst.
[00:39:34.630] – Reena Leone
No, that makes sense. I mean, back when everything was more batch oriented. Right. And then most people were using something like tableau, maybe that would have worked out a little bit better. But now there’s no way, there’s no way that that is sustainable. It’s kind of funny. That wasn’t even that long ago. And thinking about it now, I can’t imagine how hard that must be to try to live a data driven culture within an organization and having to do things so manually.
[00:40:04.250] – Muthu Lalapet
Exactly. Yeah. And it’s happening even today too. You’d be surprised. And lots of enterprises do that today as well.
[00:40:11.570] – Reena Leone
Not like I’m trying to put data analysts out of a job or anything. I’m just saying.
[00:40:17.590] – Muthu Lalapet
Let me tell you the challenges. Before we go to challenges, let me tell you what’s needed to build that kind of self service analytics system, or rather drill down system. Let’s go back to the insurance company that I said I worked there for about a couple of years. I know literally how we operated there. Even in an insurance company, you’ll be dealing with massive amounts of data, terabytes of data. Think about insurance companies nowadays. They are tracking, they are providing, they are giving you devices to monitor your vehicle usage and pricing your insurance based on that. That means they’re getting data about your usage of your vehicle, real time data, about your driving patterns. That’s massive amounts of data. First thing that they need is they need to have a system to analyze this massive amount of data. The second thing is the low latency queries. If you’re giving a self service analytical system or a drop or drill down system to an end user, like a product analysis and operation journalist, they don’t want to click on some widget and wait for 10 seconds or 20 seconds. They want to make the decision right away.
[00:41:25.630] – Muthu Lalapet
So they want to just see the query happening immediately in subsequent, if possible. And number three, again, higher concurrency, you might have only six data analysts. It’s okay if the queries perform slow if you have only six people running massive queries and that’s completely fine. But think about 200 different analysts, product analysts, pricing analysts in the insurance company, let’s say claims analyst, actuarial analyst, 200 different people asking the system at the same time to get some insights. That’s a lot of queries. That’s a small insurance company, 200 people. But think about bigger companies. Like there are thousands of people querying the system on a daily basis. A concurrency is really, really key. The last part is to have a very unified and a consistent user interface that can provide you the self service and the rapid drill down capabilities on the data. These are the four things that you would need to build that kind of a system. So far we talked about what self service is, what rapid drill down really means. We address the challenges and what’s needed to build this kind of a system. I’m going to give you an example, one of our customers who did that using Imply- Twitch is one of our customers.
[00:42:29.340] – Muthu Lalapet
For those of you who don’t know, Twitch is an Amazon company and I think most of us know Twitch. Twitch is a platform for live video streaming. It started off with live video streaming, but it’s kind of how you can do anything live, like cooking shows and music shows and live Q and A and things like that. But the leading driver for the Twitch web traffic is their video gaming, live streaming. And they also conduct live esports and things like that. It’s a pretty good site to give you an understanding of the scale, because I want to talk about the scale because our podcast name is Tales of Scale. Right.
[00:43:06.440] – Reena Leone
I love that we keep bringing it back to that.
[00:43:08.970] – Muthu Lalapet
Exactly. Yeah. The numbers here are like two years back, then two years back stats. It could be much more now after the pandemic and everything. It’s about 4 million plus unique creators streaming each month. And 20 million plus average daily visitors. 20 million plus average daily visitors. That’s what we’re talking about here. And from actual event data, it’s about close to about 10 billion events data per day. 10 billion events that they’re tracking on a daily basis, not on a monthly, on a daily basis. That results in about 1.3 terabytes of raw data per day. And there are almost about 70,000 to 100,000 queries that they need to support internally. I’m not talking about externally. They want to give analytics to the internal product analyst. It’s about 70 to 100,000 queries a day that they want to support. That’s the kind of a scale that we’re talking about here.
[00:44:04.750] – Reena Leone
I can’t even process those numbers. But everything on the consumer side just seems like everything flows so easily. We never think about on the user side how many events we’re creating. Right, that’s true.
[00:44:17.350] – Muthu Lalapet
I want to test on the number of events because that’s in the billions. Like I said, the small game that I developed was generating 50,000 events per day. You’re like 20 million visitors. You’re tracking almost everything, where they are posing, where they’re forwarding, where they are increasing the speed of the video, all those events you’re tracking, that results into about 10 billion events a day. That’s massive. First of all, to build that system, you need a system that can ingest that 10 billion events a day seamlessly without any issues. So in some systems you can ingest, but you won’t be ingesting the latest data. You’ll be ingesting a day that happened 30 minutes, if not 2 hours back. But they want to see what’s happening right now. So you want to have a system that actually ingests the data and will allow the users to query the data right away. That’s really, really key. The second thing they would need is to have a system that can query across the terabytes of data on a single day. It’s 1.3 terabytes of data. Think about like six months. That’s easily running in hundreds, if not 200 of terabytes of data.
[00:45:21.910] – Muthu Lalapet
So you need to provide a system that can allow the analysts to query across these data sets. The third thing, like I said before, concurrent users, 70 to 100,000 queries a day. That’s lots of queries that they’re going to fire on a single day. And lots of concurrent queries, lots of concurrent users and need to have a system that could support it. And finally, a system that could have a consistent user interface, easy data exploration so they can look into not only what’s happening today, but compare it to what happened to yesterday, what happened to one week back or one month back, and easy and consistent user interface to support this kind of data. Expedition is also really key. And Druid is the database for doing this. And they are using Druid today. They are one of our customers. On top of Druid. They also use Imply Pivot. Pivot is an Imply product. It’s an easy tool to do self service analytics and rapid drill down exploration on data that’s backed by Druid inside Twitch. Everybody in Twitch, including the CEO of Twitch, uses Pivot on a day to day basis to look at the metrics of what’s happening right now in the Twitch ecosystem.
[00:46:30.820] – Reena Leone
I think this is like our first Pivot namedrop. Can you explain a little bit about what Implied Pivot is?
[00:46:37.550] – Muthu Lalapet
Absolutely. Imply Pivot is one of our products. Imply Pivot provides a visualization layer on top of Druid and it is best suited for self service analytics on Druid data. It also provides rapid drill down capabilities and it provides multiple visualization options that you can use to create dashboards and data cubes. Not only that, you can use Pivot to share your dashboards to your team. You can embed Pivot in your own applications. Some of our customers have white label Pivot and exposed Pivot to their customers too. That’s also possible.
[00:47:20.650] – Reena Leone
Awesome. I figured like, I know what that is and David, you know what that is. But just for our listeners, just wanted to make sure that we got that well defined. But one thing, so we’ve been talking obviously about a lot of data. Part of the reason for that is a lot of streaming data and that’s not going to stop anytime in the future. And then on top of that is the need for there’s a lot of real time data, right? So not things that get ingested overnight, things that are happening in real time. And when we talk about real time, I’ve had a lot of work with Real Time Decisioning at my last company. We had a customer decision engine that collected and analyzed customer data using AI and machine learning. And one of the things that you could do is as you navigated a website, it helped create propensity models and propensity scores to figure out how to customize the experience as you browsed. Right, so among other things. But the data, it was collecting real time data, combining it with historical data, and then depending on the company, that could be just massive amounts of data.
[00:48:28.680] – Reena Leone
If you think about global brands, that could have any number of customers on their digital properties at any given time. Right? That’s just one small example. But everything we’ve talked about today, I feel like has a layer of real time data. And that’s another way that I think Druid really shines. And we talk about how data, we need real time data to inform decisions, not just in say, like this very specific or real time decision engine example. But can you tell us a little bit more about maybe some customers that we have that are using Druid to really process and analyze and ingest real time data for decisioning purposes?
[00:49:11.860] – Muthu Lalapet
Sure, that’s definitely a good use case for Druid. And some of our customers, they use Druid for real time decisioning, not only for looking into the analytics and making decisions, but having machines to make decisions as well, automatically. So before we go deeper into that, let’s try to understand what real time decisioning is all about.
[00:49:33.450] – Reena Leone
I once did a whole study on it, and it was fascinating to see how many people defined it differently. And it kind of almost was in like a use case by use case basis. But let’s define it here, right now. What is real time decisioning in terms of what we’re talking about?
[00:49:49.220] – Muthu Lalapet
Yeah, definitely. But let’s get to the basics, right? Real time decisioning, let’s put it into what’s decisioning. Let’s forget systems. As humans, we make decisions on a day to day basis. So we make decisions based on some data or disposal. Same thing with systems too. They need data and they need massive amounts of data to make those decisions. They need data of what’s happening right now and what happened before. The propensity model that you talked about ringing in your previous company, for sure they would have collected the clickstream events of all their customers to see what’s happening right now. Also, what happened before, what that particular individual did, doing right now, and what they did before, that means they need to collect this massive amount of data, isn’t it? Not? So you need data. That’s number one. And number two, we’re talking about real time here, real time decisioning. So what’s needed for real time decisioning? We humans can look at the data, we might take a couple of seconds. We might have a couple of minutes to make a decision. But we want machines to make a decision right away. So you talked about your example, about the system deciding what to show on the screen based on their clickstream pattern.
[00:51:00.370] – Muthu Lalapet
If that system takes more than 5 seconds, then obviously you’re losing it right there. Right? So that means your decisioning system should be in a position to make that decision really fast, maybe in less than a second. To make a decision in less than a second, it needs to query the data. So probably your queries respond in less than 100 milliseconds. So the system could take the remaining 900 milliseconds to make the decision. That means you need a system that can provide faster query response time. Make sense so far?
[00:51:31.010] – Reena Leone
Oh yeah, 100%. I think that system was doing like sub, like 200 milliseconds if I remember correctly.
[00:51:38.610] – Muthu Lalapet
We definitely need to talk to them and see what they’re using and probably we can sell them through it. They can do much.
[00:51:45.470] – Reena Leone
I try to dig into it because I was like I wonder if they knew, but I’m not entirely sure. But again, they’re not the only folks out there that are using real time data in collaboration with AI and machine learning.
[00:52:00.580] – Muthu Lalapet
Yeah. So let’s get back to what’s needed. The first. Like I said, you need a system to collect all this data, real time and store and do analytics on both real time and historical. Second, you need a system that responds in less than 100 milliseconds so that the decision can be made immediately. This is a low latency queries. The third- machines are not going to take rest, right? There could be multiple decision systems running parallel. That means they’re all firing queries like anything. So what that translates to higher concurrency? So it translates to much much higher concurrency. Let’s say if you have a thousand people sitting and clicking on the button at the same time, it won’t be a thousand queries for the same second. But if you have 1000 machines, not thousands of machines, so thousand instances of this dish and systems querying the system, it could be thousand queries the same millisecond. That’s possible. That means you need to have a system that can support highly concurrent queries. So these are three things that you need to build any real time decisioning system. And guess what? Druid is fit for that. Druid has direct connectors to Kafka Kinesis like we mentioned before, like what our customers are using right now.
[00:53:12.990] – Muthu Lalapet
So we can plug into those real time system, capture all the events that are happening right now. And Druid provides analytics and both historical and real time data together. So you do not have to build two different systems like what Game Analytics did. So one system is growing up. The second thing is Druid is built for low latency queries. Twitch was firing about close to 100,000 queries a day. I was talking about Atlassian’s use case about close to a million queries a day. All were performing in about subsecond. And that’s a kind of a latency that you need for real time decision making systems as well. A third higher concurrency. Druid supports higher concurrency at the lowest cost. And these decision systems can run this course much faster. They need to make the decisions they need to make. Oh, yeah. You asked about an example. I forgot about that.
[00:54:04.590] – Reena Leone
Well, I kind of gave an example. It was my turn.
[00:54:10.350] – Muthu Lalapet
We definitely talk to your previous company for sure.
[00:54:14.290] – Reena Leone
I wonder now what I’m trying to think. I don’t remember how I don’t remember what the system was built on or if I can reveal that company secrets. Okay, so we’ve really dug into exactly what you should use Druid for, what challenges folks have been facing when handling mass amounts of data and where that fits and what they come to us for. However, if somebody is looking to, like they heard this, like, I want to get started with Druid. David, where do folks go? How do people get started with Apache Druid? Where should they go?
[00:54:52.770] – David Wang
Yeah, great question. So, being an Apache project, it’s obviously free to get started. So you can go onto the Dream Apache.org site and just download Dream to get started immediately. If you’re looking for a cloud service so where you don’t have to worry about managing the database or managing the infrastructure underneath the covers, and you don’t really want to become an expert in Druids ins and outs, then the really easy way to get started is just get an account and start on a free trial with imply Polaris, which is our database as a service for Druid. So two really easy ways to get started right off the bat.
[00:55:25.740] – Reena Leone
Awesome. And then if they want more Mouthu, you and your team can do proofs of concept. Come talk to us. We can go through and kind of figure out what is needed, what you’re doing, where things can be optimized. Once again. Thank you, David. Thank you, Mutu, for being here and really walking us through everything that Druid has to offer. If you want more information on Druid or Imply, please visit imply IO. Until then, keep it real.