Speed, Scale and Streaming: Building Analytics Applications

Mar 28, 2023
Reena Leone
 

Insights play a crucial role in propelling organizations towards success and outperforming their competitors. These valuable insights originate from diverse sources, including internal data from operational systems and external data from partners, vendors, and public sources. Traditionally, analytics focused on data warehousing and business intelligence, where periodic queries on historical data fueled executive dashboards and reports.

However, visionary organizations like Netflix, Target, Salesforce, and Wikimedia Foundation have realized the immense potential of merging analytics and applications in a fresh and innovative way. Instead of sporadic analytics queries on historical data, developers at these organizations are building a new generation of applications capable of seamlessly querying vast volumes of real-time and historical data, ranging from terabytes to petabytes, and delivering lightning-fast responses even under heavy load.

This is what real-time analytics applications are all about. Their technical requirements combine the scalability of data warehouses with the speed and responsiveness of transactional databases, transforming the way organizations leverage data-driven insights to make informed decisions and optimize operational efficiency.

Listen to the episode to learn:

  • What is a real-time analytics application?
  • Key use cases for real-time analytics applications
  • What components  make up real-time analytics applications?
  • What features of Apache Druid make it suitable for real-time analytics?
  • How to get started building your own real-time analytics application

Learn more

About the guest

Darin Briskman is Director of Technology at Imply, where he helps developers create real-time data applications. He began his career at NASA in the 1980s (ask him about rockets!) and has been working with large and interesting data sets ever since. Most recently, he’s had various technical and leadership roles at Couchbase, Amazon Web Services, and Snowflake. When he’s not writing code, Darin likes to juggle, blow glass (usually not while juggling), and working to help children on the autism spectrum learn to use their special abilities to work better with the neuronormative.

Transcript

[00:00:00.810] – Reena 

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with analytics way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. Now, we’ve covered real-time data on this show, and obviously we’ve touched on Apache Druid a ton and use cases, but we know what we haven’t actually tackled build is real time analytics applications. And that’s right there in the intro. So let’s get down to business and talk about what real time analytics applications are, why you might consider building one, what you need to build one, and, you know, like, maybe a database design for them and how to get started. Joining me today is Darren Briskman, director of Technology at Imply and author of the newly released O’Reilly report, Building Real Time Analytics Applications, operational Workflows with Apache Druid. Darin, welcome to the show.

[00:00:55.640] – Darin

Thanks, it’s great to be here.

[00:00:57.350] – Reena 

Darren, I like to ask all of my guests how they came to be where they’re at, what their journey has been in the data space. I know that you have quite a history with data and technology, so can you give me an abridged version of your background and how you got here?

[00:01:16.630] – Darin

I will try. At least 70% of people who work in It and data today. I never set out to do this for a living. So I began my career as a physicist and ended up working at NASA, where, for complex reasons, I ended up in IT operations and working with large data sets and databases and stuff that was cutting edge back then, like Cray supercomputers sitting in trailers at Johnson Space Center in Texas. And ever since then, I’ve been doing things with data. I found it really interesting. I moved around a bit, spent a fairly long time with IBM, where did a lot of work with IBM’s data initiatives, helped integrate in companies like Netiza and Cloudera as IBM got serious. And in 2014, I retired from IBM, founded a little startup, ended up at Couchbase, which was interesting for about a year, and then a friend of mine convinced me to come join them at Amazon. So I spent four years doing database analytics and machine learning at Amazon Web Services. And then, okay, I was done with my career. I retired, I have all the money I need. And that lasted about six weeks before my wife said “you’re bored get out of the house.

[00:02:33.400] – Darin

You need to, you need either to find another company or or find a job”. And she was right. So I looked around and ended up joining a startup called Snowflake. And Snowflake was interesting, and I was there for the IPO, and I got to work on customer customers and professional services and strategy and some other things. And at the very end of 2021, it looked like it was time for me to look at new things and I ended up leaving Snowflake and joining Imply. So I’ve been here, we’re recording this now in March of 2023. So I’ve been here about 14 months and really enjoying the world of real time analytics and speed, scale and streaming that makes it work.

[00:03:17.060] – Reena 

Well, that is like the whole topic of today’s show because you’ve recently published a report on building real time analytics applications. And so let’s get right into that. What is a real time analytics application? Like, let’s define it because part of the reason for this show is talking about the difference between real time analytics and BI tools and where that intersection is and where they’re different. So what is a real time analytics application?

[00:03:44.090] – Darin

There isn’t really an official definition and many different groups are using the term in different ways. But usually it means I have data that is coming in as events. And this is we’re spending just a little time on a tangent because for the last 30, 40 years, since the beginning of data warehousing and the world of OLAP, we’ve really focused on analyzing transactions. And a transaction means I only care about the current state, I only care about how much inventory there is in the warehouse right now, I don’t care about how it got to where it is. Now, what we really are finding, the reason that was true is because it was too expensive to keep all of the data, keep all of the events, keep the whole record of how it got there. Well, what we’re finding in today’s world is now compute is cheap, storage is super cheap, and insights are even more valuable. So today it’s not that you can’t afford to keep all the events, you can’t afford to throw them away. So what’s real time analytics has become all of the events streaming in. And they’re streaming in initially from things like click stream, like ad bidding, like internet of things like log analytics.

[00:04:57.050] – Darin

But we’re starting to see the world of streaming grow to all sorts of transactional and other applications because the stream is not only a good way to move data into analytics, it’s a good way to move data between microservices, it’s a good way to move data between external services. My website, I might be using streams to move purchasing data to say, Visa or Mastercard or PayPal, or PayTM and then back, and then how do I analyze all this stuff? And that’s really where the real time comes in. It’s as I said, speed scale and streaming. So almost always something that’s real time analytics involves those three things. I have some sort of streaming data being delivered by some sort of data streaming platform. So that could be Apache Kafka, that could be one of the many things compatible with Apache Kafka, like Confluent or Red Panda or Azure Event Hub, or it can be something else, like Pulsar or like Amazon Kinesis. That stream is going to be delivering thousands or tens of thousands or many of our customers millions of events every second. And then those have to have a database that can both ingest that stuff in real time so that I can analyze that data as soon as it arrives.

[00:06:09.930] – Darin

I don’t want to know 20 minutes from now that something interesting may have happened 20 minutes ago. I need to know it now so I can take action and do things on it. And it also means we end up with large data quantities because each event in a stream is usually small, they’re usually in the one to two kilobyte size, but you do millions of those per second, which is tens of billions per day, and the data adds up very quickly. So the things that really make this interesting, if I’m going to do real time analytics, is I need to be able to handle a large quantity of data very quickly and be able to query it very quickly. And this is not what analytics databases are built for today. So this is creating kind of a new world and a new world of analytics applications.

[00:06:54.170] – Reena 

And so this has kind of become the default Apache Druid podcast. And when we talk about Apache Druid, its use case as a database fits really, really well, dare I say, perfectly with the use case for real-time analytics applications. Can you tell me a little bit more about how Apache Druid is perhaps the right choice for this type of analytics application?

[00:07:19.560] – Darin

Sure. So one of my favorite sayings from Winston Churchill, which like most of his sayings, he may or may not have actually said, is that an ounce of history is worth a pound of logic. And the ounce of history here is, where did Druid come from? Druid came from four really brilliant engineers who were working on a problem in ad tech.

[00:07:39.700] – Reena 

Yes. We had Eric “Cheddar” on the show to tell us all about the Druid origin story. Please go back and listen to that episode.

[00:07:46.790] – Darin

But the super short version of that is, in order to do real-time bidding, they needed something that could pull in a billion events in less than a minute and query a billion events in less than a second.

[00:07:57.300] – Reena 

And nothing was fitting the bill at the time.

[00:07:59.450] – Darin

Exactly. They tried all the existing technologies and they just didn’t work. And because they were young and a little bit foolish, they said, we’ll make our own database. How hard can it be?

[00:08:08.650] – Reena 

We definitely touched on that.

[00:08:10.400] – Darin

Turned out to be pretty hard. Yeah, but they succeeded and they created a Druid and then followed what’s great about Silicon Valley, which is they used it for their company, but then said, well, there’s not enough of us to really develop this, so let’s make it open source. And it was quickly adopted by Netflix and by other companies and really grew into this really great community that Druid is today. So if we look back at that speed, speed is both ingestion. How quickly can I get data into the database? So Druid is able to both do high speed ingestion from batch. So I have a chunk of data that’s sitting in a file, or maybe I pull it out of Oracle or SQL Server. I can pull that stuff in really fast. But it also is really fast for streams. So I can have thousands, tens of thousands, millions. I know of one installation that’s tens of millions of events coming in per second, that it can put all those in there and make them immediately available for query. So you don’t have to wait for a micro batch or something like that. It’s at the moment it arrives milliseconds, single digit milliseconds after arriving, it’s available in the query and then scale lots of people using Druid.

[00:09:22.860] – Darin

I mean, you can run it on a laptop, but you can also run it on clusters of tens of thousands of servers. So a lot of customers have petabytes, tens of petabytes, whatever, while still doing subsecond queries. Now, the reason for this is a really cool thing called scatter gather architecture. It’s probably deeper than we want to go today.

[00:09:40.590] – Reena 

We talked about scatter gather. We did a whole thing talking about the new MSQ engine and framework and what scatter gather versus shuffle mesh. Oh, no, we go there.

[00:09:51.550] – Darin

I’m not going to repeat that whole story either. Except, to make a long story short, how can I do a subsecond query on a petabyte of data? Well, I’m not really doing a subsecond query across a petabyte. I’m really doing 2000 subsecond queries across 500 megabyte chunks. And you know, it’s hitting all of those and then gathering back together. That’s how you get the speed and the scale. And of course, as we said, it was designed explicitly for streaming data. So the reason you can do a petabyte query in under a second is it’s not really one petabyte query. It’s really 2,000,500 megabyte queries that are scattered across all of these micro partitions, or as we call them, segments, and then gathered back together. So that’s how we can deal with that speed and that scale. And of course, the streaming is inherent. So this is what Druid is designed for. And pretty much all of the thousands of projects out in the world that are using Druid have some combination of they needed to be really, really fast or really, really big or deal with streaming data. And in many cases, all three.

[00:10:55.970] – Reena 

Actually. That brings up a really good point. So there are plenty of companies who are building real-time analytics with Druid. Who’s getting it right, who has a really great use case, I know that you are, we’ve talked about a few on the show before, but who really is getting the real time analytics application idea right?

[00:11:17.500] – Darin

So that’s actually a harder question than it sounds because everyone has their own use case and their own needs or.

[00:11:24.190] – Reena 

Maybe what are some of the most interesting use cases?

[00:11:26.720] – Darin

Well, there’s one that everybody can go use the interface and play with it, and that is Wikimedia. So if you use your favorite search engine and look for Wikimedia, this is the reporting that’s done for Wikipedia and all of the other projects in the Wikimedia Foundation that they’re using Druid to provide both updates. Now in their case it’s not really real time because the nature of what they’re doing, it’s okay if it’s a few minutes out of date. But what they need Druid for is that interactive data exploration. So that I can easily drill in and say, okay, how many mobile devices in Finland hit pages that are not in English and be able to just point and click and easily hit those kind of complicated queries and get answers back right away in the interface? And this is all public stuff. You can go look for Wikimedia and go find it. And of course, being the Wikimedia Foundation, it’s 100% open source. So they’ve put everything they’ve done in the GitHub if you want to recreate it and do it yourself. Another good example of somebody, and this is a little more complex, is ThousandEyes.

[00:12:31.930] – Darin

So ThousandEyes is a system monitoring company. They were a startup. They were acquired by Cisco a few years ago. They still operate pretty much autonomously as ThousandEyes, and the name tells you what they’re doing. They’re putting sensors all over your systems so that you can watch your network equipment, you can watch your physical machines. If you’re on prem, you can watch your virtual machines when you’re on cloud, you can watch your applications, you can have up to the date information on everything happening on your network and you can set the necessary alerting to go with that monitoring. Instead of something’s going wrong, you know it immediately. And this is particularly useful for people who are doing complex applications and many services because this is the whole thing about reliability. If say, Amazon goes down, everybody knows there’s a big problem and nobody blames you for it. But when your application normally takes 5 seconds and the day it’s taking 30 seconds, then you get people yelling. And that’s much harder. It’s easy to diagnose something’s wrong with it’s, like smoke coming out of a building. It’s much harder to diagnose that when it’s just slower than it should be or wonky somehow.

[00:13:36.640] – Darin

And this is what Thousand Eyes let you do. Now, when ThousandEyes started out, the database that they use was MongoDB, and MongoDB is awesome. Lots of people use Mongo together with Druid. It is really easy to develop with. It’s a document database. They have all sorts of cloud and open source and on prem options and ways to run it. But it’s not designed for real time applications. So when they were at their prototype, Mongo was great. And when they had five customers, Mongo was okay. And when they had 20 customers, it started slowing down. And when they were getting up towards 100 customers, it was taking minutes. Many minutes.

[00:14:12.920] – Reena 

Like minutes in this instance is a lifetime.

[00:14:16.890] – Darin

Yeah, we joke about this, but when you’re online, seconds really count. 5 seconds doesn’t sound like a long gap, but if you’re staring at a website or listening to a podcast, then 5 seconds seems like forever.

[00:14:31.550] – Reena 

I’m glad that I know you well enough to know what you were going to do right there.

[00:14:36.310] – Darin

That’s good. So what ThousandEyes ended up doing was moving their back end onto Apache Druid. In this case, they chose to partner with Imply. This was important enough to them. They wanted a other things you get. And that dropped their time all the way back down to one to 2 seconds for everything. And they are much bigger now. I don’t even know how many customers they have. But becoming part of Cisco of course, exposed them to a lot more customers who have picked them up and put them into their systems. So ThousandEyes is now able to scale and have the speed that they need and deal with the streaming data, which is the core to everything that we were talking about with real time. The other part of that that’s important, something we didn’t really talk about much, is concurrency. So I want to mention that for a moment because it’s often important in real time applications where it’s not as important in traditional reporting. Right? So if I’m doing like monthly reports or sales reports or quarterly reports, I’m running one report every so often and I don’t really care that much how long it takes.

[00:15:41.050] – Darin

I don’t want to take days. But if I’m running my monthly sales report, if it takes five minutes, it takes ten minutes. Who cares? Once a month, two in the morning, even for daily reports, the time doesn’t matter. And for that matter, I don’t care that much about uptime. I mean, I want the system to mostly be up. But again, if the thing needs to go offline for maintenance for an hour, fine. And then concurrency, I only have one or two or three maybe analysts running a report at the same time in real time. Analytics, all of that is different. First of all, I constantly have data arriving and ingesting. I have no time for outages. Whether they’re planned outages or unplanned outages, right? I can’t say, okay, everybody stop doing analytics for an hour on Sunday afternoon because there’s stuff happening on Sunday afternoon in most of these cases. Similarly, I need to make sure that these things can run lots of concurrent queries. Because when I’m ThousandEyes and my customers are running the queries to see what’s happening in their environment, how many are going to run it at a time. Might be one, might be 10,000, I don’t know in advance.

[00:16:41.140] – Darin

So one of the things about real time analytics databases, and Druid in particular, is there’s really no limits on concurrency. You will have to add some hardware if you want to do 1000 concurrent or 10,000 concurrent, but the software can handle that. And this is really key for real-time analytics, is having no limits on the concurrency and also no downtime. So one of the things about Druid, it continuously backs itself up as part of its architecture. Something there called deep storage, which is not just for backup, it’s part of operations. But it means I have a copy of all of the data sitting in some sort of object store. So on the cloud that’s something like S3 or Azure Blob or Azure data Lake storage. And if it’s on premise, then you’ll have some sort of reliable storage device using an HDFS array. But the data is constantly backed up there, so even in an outage you don’t lose any data and no planned downtime. I have to do upgrades once in a while, but they’re all rolling. So Druid is designed that I can do any sort of upgrade I need to do while this cluster is up and stays up.

[00:17:46.660] – Darin

And for that matter, when hardware fails, and I didn’t say if hardware always fails eventually, then the Druid just keeps running at a lower capacity until you replace or add more hardware and I can add more or remove it while everything is up and running and it’s completely dynamic. So I know I’m sounding a little bit like a sales pitch, but this is one of the things that got me really excited about Druid and why I came here. Because I’ve been running with mission critical systems for 35 years. I’m really old and all the time we would have outages and in the Druid world, the only way you have an outage is if you lose a whole cloud region or the whole data center.

[00:18:25.790] – Reena 

And at that point, what can you do?

[00:18:27.570] – Darin

Well, if it’s mission critical, then you spin it up in your reserve data center. And many of our customers who are that critical will keep an extra copy of the storage somewhere else so we can restore this thing and have it up and running again in minutes. So it really is high availability. I don’t really like the term disaster recovery because in Druid you don’t recover from the disaster. It’s business continuance. Things just keep happening. You don’t have a disaster that you recover from. So cloud outages don’t happen that much. But there is a significant one this year in Southeast Asia with Google. There was a huge one in Northern Virginia about a year and a half ago with Amazon and our customers didn’t lose data and didn’t go down because it was an environment designed to keep running. So that’s really important if you’re doing something like Thousand Eyes where people are depending on reliable services all the time.

[00:19:22.050] – Reena 

When you mentioned high concurrency, another one that came to mind, and we talked about this before, is Atlassian, right? And it’s specifically around Confluence. Not to be confused with Confluent, that’s another use case, different story, also a Druid user. But that’s another reason for their Druid instance. One of the many reasons is the ability to handle high concurrency.

[00:19:48.260] – Darin

Yeah, so if you’re a developer, you already knew all about Atlassian and you probably have used Jira and Confluence and those other tools. If you’re not a developer. Jira is a ticketing system that’s very commonly used to manage development tasks. And Confluence is the Wiki product. It’s the place where you can put information and keep it. We use it as an Imply among others. And if you are running a development shop and you’ve got lots of people putting in and changing another thing on Notes, I think anyone out there who’s ever dealt with an intranet, you realize the biggest problem is keeping it current because it’s easy to write web pages. But then people go looking for information, they find stuff like four years old and it’s misleading. So the way you deal with that is to have good analytics and have the ability to quickly pull up, where is their activity? What pages are people looking at? What pages haven’t been touched in six months? So maybe we should drop them or update them and so on. And if you’re using Atlassian, Confluence, that’s under a button that with no imagination, they named analytics.

[00:20:48.520] – Darin

And how do they actually do that? They do that with Druid because every time you have a web click, you have activity on Confluence that is an event. That event is going over Kafka stream. That Kafka stream is being ingested into a Druid cluster so that we have real time information about what? Every customer at Confluence, what they’re using now, there’s still security and privacy. This system doesn’t tell anyone what’s on those pages, it tells people what pages they’re using and how it’s happening. So you, as a Confluence customer, anytime you want to, can both click and see analytics and can also set alerts and give me a report. How many things haven’t been looked at for three months, six months. The reverse side, which pages have been seen more than a thousand times today because something weird is going on. If that’s happening and figure those things out and everything else we talked about, does it need to be sub second real time for this? Probably not, but I still want it pretty fast. It’s a big scale. Absolutely. Atlassian is used by tens of thousands of development shops around the world. Literally millions of developers use Jira and other Atlassian tools every day.

[00:21:56.450] – Darin

And of course you need to be able to search through very large quantities of data quickly. And with high concurrency. So it’s another place that just fits into the things that Druid is good for. And we could do this all day because like we said, there’s thousands of projects out there with Druid that are live right now.

[00:22:11.260] – Reena 

Speaking of projects, so we’ve talked about some use cases, companies that are building real-time analytics applications. If I want to build a real-time analytics application, how do I get started, where do I go? What would be your recommendation? I feel like my company needs this. I have a use case that fits. What do I do?

[00:22:33.560] – Darin

There’s a few things I recommend. One is Druid has a fantastic and truly helpful community. So if you go to Druid’s home page, which is druid.apache.org. 

[00:22:46.660] – Reena 

We always make sure to plug that.

[00:22:47.960] – Darin

Absolutely. And right in the middle and the top of the page, it says Join the Slack. So click on that. You don’t have to have a Slack license. If you own a copy of Slack, it will integrate in. If you don’t, you just do it through a browser. And now you have instant communication with thousands of people in the Druid community who are very friendly and welcoming and helpful, and not just nerds like me, but real nerds like the ones who created Druid are on there every day. So you’ll get answers to any questions that you put then, well, I’ll toot my own horn a little. You should download a free copy of building real time analytics applications. So there’s two places you can get that. The easiest place is go to imply.io and you’ll find that pretty easily. Or just search around for building you’ll find that. Or you can download it from O’Reilly. Com. And again look for building real time analytics applications. It’s about 35 pages. This is not like a gigantic book that you’re committing a lot of time to.

[00:23:47.640] – Reena 

It’s an easy read. It’s an easy read.

[00:23:49.640] – Darin

Easy read. My wife described it as either a one bath or two poop read, which is about the right length of that if you want to use it that way. And it’s free. So I won’t tell you what to do with the book after you’re done reading it, but that’s up to you. And then to actually do things well, I could set up my own Druid environment. It’s not that hard to do, but if you’ve never done it before, it’s a little bit tricky because part of what makes Druid have such cool speed, scale and streaming is it’s got a lot of moving parts. So you got to set up a lot of things. So to avoid that, if you go to imply.io/polaris, or just go to imply.io and click on Polaris or Free Trial, that will set you up with Polaris. So polaris is druid as a service. It’s free for 30 days. You don’t even need a credit card, just an email and a name and that will give you a fully functional and quite scalable Druid environment that you can use to start writing stuff and see how things go.

[00:24:45.230] – Darin

And then after 30 days, if you still want to do more, then you can either start paying for it or you can just grab an open source version, because by then you’ll understand it well enough, probably, that you’ll know what you need to do to set it up and make it work. And then the one other thing I’d suggest you look for at employee IO is some documents about how to do your own proof of concept. So these are white papers. They’re about 1012 pages long, but they kind of walk you through saying, I want to do a proof of concept, so what do I need to think about? What concept do I want to prove? How do I make that work? So there’s one that’s focused on open source Druid, one that’s focused on Polaris. The database is a service, but they’re about 90% identical, and it gives you a chance to look through these and say, how could I set this up and do my project? But Polaris, if you already know what you want to do, you can go, you know, you put in the request, two to three minutes later, you get an email with, here’s your Polaris environment.

[00:25:41.270] – Darin

You can start loading data immediately. And, you know, I’ve seen people, you know, have running applications in ten minutes.

[00:25:48.050] – Reena 

And as we brought up before, you can run this on a laptop.

[00:25:50.980] – Darin

Well, if you’re running Polaris, it’s all in a web browser. So it runs on the cloud. But, you know, you don’t care. You just need something with a web browser. So you could use a laptop, you could use a liquid nitrogen cooled mainframe if that’s what you have available. But whatever you’ve got, all you really need is Chrome or Firefox or Safari or Edge or some browser that can connect. So it’s easy to do, and I’d urge you to try it, because if you’re a technologist, even if you’re not doing real time analytics today, it’s one of those topics you need to know about, because it’s something coming today. It’s a relatively small part of the total analytics landscape, but it’s growing really fast. And the reason for that is streaming is growing really fast. Those streaming and real time analytics are growing at above 100% a year. If you’re not using it now, you probably will be. So it’s a good idea to get to know a little something about it.

[00:26:36.610] – Reena 

Oh, yeah, and streaming is definitely not going anywhere.

[00:26:39.570] – Darin

I’d argue with you. It’s going everywhere.

[00:26:41.290] – Reena 

Okay, fine, fine. Semantics. It’s going everywhere.

[00:26:44.450] – Darin

But it is absolutely true, because when streaming first came out, it was for delivering things like clickstreams. If you know the Kafka history that started at LinkedIn and then ended. Up at Netflix and other places where people are looking for clickstreams. Today, it’s to deliver all sorts of data that’s happening in events. Telemetry data, IoT data, autonomous vehicle data, gaming console data, still web streaming data. But we’re also seeing streaming as the main way to connect different microservices and complex applications. So streaming is moving from kind of a niche product to one of the core building blocks of enterprise applications. It’s everywhere. And if you’re going to use streaming for your applications, before long you’re going to need to analyze what’s going on. You’re going to need a real time analytics application, and the best way to build that is with a real time analytics database like Druid.

[00:27:38.010] – Reena 

Well, Darin, thank you so much for joining me and talking me through building real time analytics applications. And like Darin said, if you want to learn more about Druid, visit Druid.Apache.org. And if you want to learn more about Imply or Polaris need help getting started? Please visit imply.io. And until next time, keep it real.

Let us help with your analytics apps

Request a Demo