All things Apache Druid 27.0: From deep storage queries to new visualization with Will Xu
On this episode of the “Tales at Scale” podcast, we go all in on the newly released Druid 27.0. We explore the new web console explore view, which aims to enhance the developer experience by enabling easy visualization and experimentation with data in the Druid database. And we dive into the highly anticipated asynchronous query and query from deep storage feature, which reduces the cost of ownership for large data sets by querying infrequently accessed data stored in cheaper storage hardware.
In addition to new features, Druid 27 included some improvements to make Druid even better. This includes improvements made to schema auto-discovery, which was released earlier this year in Druid 26. Platform compatibility improvements have also been made, including progress on native integration with Kubernetes for data processing and ingestion.
But that’s not all! Listen to this episode to learn more about:
- The integration of Apache Iceberg, a high-performance format for managing large analytic tables, into Druid
- Improvements made to schema auto-discovery, making it easier to ingest data without specifying a schema
- The future roadmap for Apache Druid, which includes adding window functions to the query language, improving performance on high cardinality data sets, supporting Spark for loading and reading data, and bringing the Kubernetes-based ingestion to fruition
- Introducing Apache Druid 27.0.0
- Schema Auto-Discovery with Apache Druid
- Druid Architecture & Concepts
About the Author
Will Xu is a product manager at Imply. Prior to joining Imply, his product management career has included roles at Apache Hive, Hbase, Phoenix as well as being the first product manager for Datadog. His experience with data and metrics includes product management for Facebook external metrics & Microsoft Windows data platform.
[00:00:00.410] – Reena Leone
Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, real time data, and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. And we are back again with another Druid release. It seems like only yesterday when we were discussing Druid 26, and here we are at Druid 27. Thanks to the dedication of the Druid community, this release was made possible by over 350 commits from 46 contributors. And joining me today to talk about what’s in this release is Will Xu, product manager at Imply. Will, welcome to the show.
[00:00:40.780] – Will Xu
I’m happy to be here.
[00:00:42.050] – Reena Leone
Since you’re a first time guest on Tales at Scale, let’s start a little bit about your background and how you got to where you are today.
[00:00:49.760] – Will Xu
Yeah, I’d love to talk about that. I have been with Imply, the company, for a little bit over three years at this point. Before joining Imply, I was a product manager working at Meta on external metrics, and before that, I was the product manager working at a company called Hortonworks today known as Cloudera on some big data product. Some of those you might know the name of, such as Hive, HBase, Sinex, and also Druid. And before that, I was the first product manager for Datadog.
[00:01:44.080] – Reena Leone
Okay, so you had familiarity with Apache Druid before joining Imply?
[00:01:48.350] – Will Xu
[00:01:49.330] – Reena Leone
So, Will, for Imply, you kind of coordinate the releases and you know everything that’s going on with Druid. And here we are at 27. It’s out, it’s available. Let’s talk about what’s in this release, starting with what’s been added to Apache Druid. Can you tell me a little bit about the new web console Explore View?
[00:02:07.880] – Will Xu
Absolutely. Yeah. One of the things we’ve noticed is Druid is a great low latency real time database. It’s really awesome for building applications, but a lot of times the developers who are building or deploying the database are not necessarily experts in front end developments. And a lot of the front end libraries, such as D3 or other popular charting libraries, requires quite a bit of work to set up and to connect to a database. So as a community, we banded together and then say, what if we want to enable end to end, really easy to access developer experience so that you can start visualizing and experimenting with data in the database and see how it can show up in a realtime analytic application? That’s why we started embedding a really simple example view into the tool so that once you have the cluster up and running, once you have the Druid database up and running and loaded some sample data, you can easily start doing visualization and exploration on this data. The hope here is this will form the basis of a really nice development experience and you can port the same view to an application that you’re building with relative ease.
[00:03:23.830] – Will Xu
And also this will allow people from the front end team, the people who are doing analysis as well as the back end engineering team to collaborate in same view very easily. So you can show someone a word like a zoom screen share to say hey, is this how you want your data to look like? And you can quickly adjust things before you make that into application or code. So that’s the intent behind the “Vad view”. The “Vad view” provides a lot of the very classic charting types that you might see things like line charts or like stack error charts and all those things can be built on top of existing Druid data with relative ease. And then what’s really cool about this is there’s, like, one little option to the top right corner of the view, if you try it out, is if you drag and drop and build some beautiful charts, it will actually show you the query that’s generated by the view to render the chart. And you can just copy paste the query into your code to replicate the same experience in your application. So going back to this whole feature is built around making it really easy for new developers or existing developers developing new experiences to easily adapt to Druid for their applications.
[00:04:40.830] – Reena Leone
A couple of things too, since we’re talking about something that’s very visual, there’s already a blog post that kind of gives you a screenshot of what this is and instructions on how to get this view yourself. And did you say it was the “Vad view”? Is that what you called it?
[00:04:54.990] – Will Xu
Yeah, I said the developer review. But the main person who has contributing to this is Vad [Vadim Ogievetsky] and he’s been the main contributor for web console in the open source community for the last however many years. And this is definitely one of the latest additions that he’s bringing to us.
[00:05:14.150] – Reena Leone
Actually he did our 27 not 27 26 episode and I almost feel bad because visualization is Vadim’s whole thing and I do an audio podcast so I always think about that when I have him on the show. Okay, so that is amazing to hear that that is available. But another thing that I saw come through in this release is asynchronous query in query from deep storage. And this is really exciting because I feel like a lot of people have been waiting for.
[00:05:43.920] – Will Xu
Yes. Yeah, this is one thing that we have also been iterating quite a bit over the course of last year for those who might not have the full background on why Reena is so pumped up about this feature and why everyone is so excited is this is probably one of the highest requested feature by the community. Looking at the data industry as a whole over the course of last ten years, one of the big push is to be able to fit the workload a lot more elastically so that we can maximize resource utilization and minimize cost. And what that means is for a low latency database such as Druid, we require a lot of specialized power. The data needs to stay on solid state drives which are very fast, the data needs to be preloaded so that you can have really high concurrency. All those contribute to a lot of cost. It does make sense to have those kind of costs associated with a high concurrency low latency workload because on a per-query basis you are paying minimum.
[00:06:50.140] – Will Xu
However, as the data ages, let’s say you have data that’s a year old or two year or even five year old that you access once only in a month or a year for things like audits or for longer term reporting it’s less ideal to have them sit on really expensive power. So the entire industry is pushing towards a mode where we can decouple the compute and storage where the storage can be on really cheap storage hardware such as object stores. The typical things that you hear are Amazon AWS S3s or Azure Blob storage or GCP Blob storage while the compute layer that does the actual query processing is spun up and down on demand. So this will allow you to basically pay an on demand price for infrequently accessed data which will drastically reduce the total cost ownership for large data sets. And this is precisely what this feature is built to do. It allows you to query really old data that’s infrequently accessed on demand by using the elasticity provided by cloud infrastructures. Now, one of the biggest challenge for a feature like this that you will see a lot of other database struggle with is the longer running reporting style query is using the same resources with the hot low latency queries that you will experience and someone will decide to run a monthly sales report that takes 8 hours and it will use up majority of the resources in the infrastructure and causing everyone else who is running the one to two second low latency query to wait.
[00:08:27.840] – Will Xu
And that’s not the experience we think is going to be ideal. So Druid in this case we’ve decided to go down the route of separating the hot and cold tier completely. For the hot tier we’ll ensure low latency, we’ll ensure high concurrency and for the querying from deep storage we will be using an asynchronous query mechanism that uses the current ingestion system. Under the hood, it uses a sync of multistage query engine that will elastically scale up and down and use its own dedicated resources. This ensures your long running, infrequently accessed queries to not touch the resources used by the healthier queries and ensuring both people who are running long queries and short queries have optimal experience. So in this release, Druid 27, this will be shipping as an experimental feature. And please do give it a try. It’s pretty awesome.
[00:09:22.480] – Reena Leone
It’s so exciting because I feel like when MSQ came out earlier a few releases ago, and I think it was the fourth episode of the show where we kind of dove into your query options like the Ferrari or the SUV, depending on what you needed from your data. Did you need it to be faster? Did you need a lot of storage? Right. So it’s fun to see it come through fruition from the start of the show.
[00:09:49.550] – Reena Leone
Another cool thing I saw in this release was Apache Iceberg support. And so for those who are unfamiliar, Apache Iceberg is a high performance format for huge analytic tables. So it’s designed for improved performance and scalability and consistency for managing large data sets, which is why it makes sense in this case, it’s compatible with Spark and with Flink and now with Druid. Can you tell me more about that?
[00:10:13.940] – Will Xu
Yeah, sort of going back to the old days when Hive and Spark was initially introduced, there has been this conversation around how do you manage vast amount of data. In the Hadoop world, you will see a lot of the historical or the data objects being stored in what we typically call data lakes today. They’re usually Parquet or like ORC files stored in a giant pool on HDFS or in S3. It’s very cheap. But in order for the database to figure out where the files are, what are the metadata associated with them? There’s usually a piece of metastore. The most typical one that you see out there is the Hive metastore. It’s also the technology that AWS Glue is built on top of. Iceberg is really the second incarnation of that. It’s a new technology that marries the management of the metadata as well as underlying physical data objects into a single solution. It’s really popular in some of the largest deployment that we see in the world today. For example, Netflix is using Iceberg extensively. Apple is using Iceberg extensively for their internal deployments and as a sort of data lake layer which contains a lot of historical data.
[00:11:36.610] – Will Xu
Sometimes you really want the ability to accelerate that. You really want to serve those data in really low latency. And this is sort of where database like Druid comes in. And the Iceberg integration this time around is contributed by Apple into the community because they also use Druid internally. That’s why they want to sort of build a bridge between those two solutions. And the Iceberg connection really allows Druid to treat the Iceberg lake of data as a source and be able to ingest those data and start accelerating and serving queries at really high concurrency, at really low latency. And coupled with the asynchronous query as well as the multi-stage query engine, you can very easily just run a SQL query to select some table or some data set and then insert into Druid database. And then the engine will basically take care of the rest. Will do the transcoding or do the right indexing so that the data becomes available for querying on the Druid side.
[00:12:42.610] – Reena Leone
In addition, this is kind of like a release between bigger releases, but I feel like there’s a lot here. Another one was smart segment loading. But before we dive into that, as I was researching a little bit, you talk a lot about the coordinator since that’s part of it. Can you tell me more about that?
[00:12:58.600] – Will Xu
Yeah, absolutely. So Druid is a database. It’s built on a very key concept, which is complete decoupling of responsibility. And what that means is the ingestion its own system, the querying its own system. The cluster management is its own system. Being a really large distributed database, it can operate on thousands of nodes. And as you are scaling up and down, as you are adding new nodes or new data set, something needs to happen for the segments or the data to move around them and then ensure the entire cluster stays balanced, ensure there’s no hotspots. This is the job of the coordinator process. It coordinates the nodes within the cluster to cooperate with each other and assigns data sets to different nodes as your cluster’s shape varies. In this case, Druid’s unit of data storage is a segment file. Think about it like a really fancy zip file that contains a lot of data. And as the cluster is scaling up and down or changes in shape, the coordinator is what assigns or loads those segments around the cluster to ensure balance.
[00:14:14.210] – Reena Leone
I’ve never heard of a zip file described as fancy. So that’s the first one. Thank you, Will, that makes a lot of sense. Okay, so let’s dive into now that we’ve established what the coordinator is, for my own knowledge, let’s talk about smart segment loading. What does it do?
[00:14:28.660] – Will Xu
Yeah, as you can imagine for a really large cluster, let’s say you have 1000 nodes and then your data is being distributed in those fancy zip files. You can have millions of those zip files, and moving them around is quite tricky and complicated because it requires copying of the data it uses, like not lot of network bandwidth and because in last cluster, the nodes are informal. So, for example, at any given time, any node can fail. And you can add nodes, you can remove nodes, your data shape can change. The traditional management of this in a lot of clusters are highly manual. So there’s a cluster administrator or there’s a DevOps team whose job is building a system that monitors the cluster, does the balancing through scripts or other management processes to ensure balancing. The coordinator is trying to automate that job away so that the cluster will auto balance itself. However, it has some limitations. For example, it will say, hey, node A, you just come up, can you load XYZ file? And then the loading instruction gets into queue and then loading takes like an hour or two because there’s a lot of data.
[00:15:48.590] – Will Xu
Now what happens in the middle if the node a decide to quit for whatever reason today that queue doesn’t get canceled, that queue sort of sits there until it finishes and then the second node comes up to take over node A’s responsibility. Now the entire queue need to be redone by a second node. And as you can imagine, as the cluster complexity increases as the more nodes you add, there’s a lot of instances where this will introduce instability. Either the data are being under replicated or over replicated, or we’re seeing there are hotspots in the cluster where certain nodes are holding on to too much data. And then the impact on end user perspective is you’ll start to see query slowness, you’ll see query timeout failures. And under extreme scenarios, the cluster might fall over because it goes into a state where it basically gets completely out of balance. And then the smart segment loading is really smart. It’s basically the second incarnation, if you may, for the coordinator system. It accounts for all those factors. It will look at whether a node is healthy, it will look at what are the jobs being assigned to each of the worker in the system or each of the nodes in the system and make sure no one is taking on too much work or too little work.
[00:17:14.920] – Will Xu
It will ensure the storage stays balanced, but of course the cluster, it will prioritize depending on the valuability of data. So it comes for all those factors and then makes smart decisions to do the cluster balancing system, to do the cluster balancing operation. And that will ensure the cluster is very stable and ensures the queries are performing well.
[00:17:37.970] – Reena Leone
I mean, I just said this wasn’t that major of a release, but these things feel like they’re pretty important and it’s nice to see so many new features added and I feel like this happened pretty quickly. But another thing with every release is improvements. Right? So in this one, this is no exception to that. And this included actually schema auto discovery, which we talked about in Druid 26. What’s improved there?
[00:18:02.650] – Will Xu
Yeah, schema auto discovery is an experimental feature we have introduced in Druid 26 and in this release we have made some fixes around the edge cases to ensure it’s like a little bit more stable. And we’re considered that as generally available in this release and please do use it in production. For those who haven’t been following our release very closely, schema auto story is a feature that allows you to ingest any data effectively without specifying schema and we will also detect the right data types, the right shapes and ingest them to make them queryable. Imagine if you have MongoDB or if you have a Parquet file laying around, you don’t have to do any kind of design. Just say Druid, load this piece of data and then the engine take care of the rest and you see a query ball table coming out of the other end with really high performance. So the main improvements in this release is mostly around fixes to ensure we don’t run out the memory, to ensure the indexes are not using too much space. But as we progress, we will continue to invest into this feature because we see being able to ingest any shape of data with high degree of flexibility, a pretty good user experience improvement for everyone involved in developing this databases.
[00:19:26.070] – Reena Leone
That’s awesome. Let me see, I know there were some other improvements Will that I might be forgetting… What else was in this particular release?
[00:19:33.760] – Will Xu
Yeah, so there are a few other interesting pieces. They are all related to the improvement on the platform compatibility. So what that means is today Druid will work greatly on graph instances or arm instances that support it on AWS today. We’re ever expanding the underlying power platform that we can support. The more interesting thing of the platform improvements that we’re doing, and please do keep eye on this, is we are moving towards a world where we will do native integration with Kubernetes and using Kubernetes scheduler and auto scaler for doing data processing and ingestion.
[00:20:13.890] – Reena Leone
Oooo that’s exciting.
[00:20:16.070] – Will Xu
It’s absolutely awesome. Yeah. The naming of it is very confusing in the community. As always, the community gave this feature’s name middle Manager less ingestion or Middle manager less data processing. I think it’s one of those two. What the system does is today Druid has a process called a middle manager. The middle manager schedules processes to do data processing. And this is great if you’re on Amazon or if you’re on Google Cloud. The middle manager will reach out to the cloud provider and then coordinate with the cloud provider to do things. Hey, add a few more virtual machines or add a few more containers so that I can run my tasks successfully. There’s a whole system built around this, but there’s a lot of complexity. It only works with specific cloud venders under specific configurations. It’s really hard to maintain. So this is not something that we see a lot of people successfully adopting. However, because Kubernetes has been getting ever more popular, pretty much every single deployment cluster that we see Druid is running into has some sort of Kubernetes fabric sitting either below it or aside to it. And Kubernetes has a great built in scheduler for managing jobs.
[00:21:43.170] – Will Xu
So the community came around to say, why don’t we just use that system for job scheduling? Why don’t we leverage the existing auto scaling infrastructure provided by Kubernetes to solve this problem and that’s precisely the community did. We took out the middle manager. Now the Druid task will be scheduled directly against the Kubernetes fabric and then the fabric itself will figure out all the scaling, the resource assignment and all the other operational logistics of running a Druid ingestion process or data processing process, in this case. This feature is considered as experimental still in the community, even though we do see quite a few large customers starting to adopting it. And please keep a close eye on this because we’ll definitely be maturing this and then get this to a generally available quality in the next few releases.
[00:22:36.610] – Reena Leone
Actually that is a really good segue to a question. I mean, I don’t mean to put you on the spot, but can you give us a preview of what is on the roadmap and kind of what’s to come maybe in 28? If it’s too soon, that is totally okay too.
[00:22:52.550] – Will Xu
Yeah, absolutely. So one of the exciting things I guess we didn’t really have a chance to talk about yet is in this year Druid actually has a community roadmap. If you go to GitHub on GitHub.com/Apache/Druid, this is where the project homepage is and search under the issues you will see a community roadmap and then the community roadmap talks about everything that the community wants to do in 2023. And what are some of the interesting things on Community roadmap? I can talk about those. The Community roadmap is divided into a few categories. There’s the query language, how do we want to improve our compatibility with SQL? The second piece is the query engine, how do we improve the processing engine itself? The third bucket is what we call data management and it has to do with how to get data in and how to make sure it stays there. So for example, a lot of the coordinator changes that we’ve been talking about falls into this bucket. And then lastly operations and platform support. So some of the Kubernetes runner, the Java support falls into the last bucket. Those are the four big buckets that we’re going to work on in the community this year.
[00:24:08.940] – Will Xu
So what are some of the exciting things that’s coming on the roadmap? On the query language side, we are aiming to add window functions.
[00:24:17.290] – Reena Leone
Yay, window functions. That’s another one with cold tier that everyone was very excited about was window functions.
[00:24:25.080] – Will Xu
Yes, absolutely. Yeah. Besides asynchronous query and querying from deep storage, window function is absolutely awesome. As you can imagine, Druid is a real time database. Everyone’s sending event data and to stitch event data together you need some windows because you can look back forever. And the time window you’re looking at is window functions window and it’s going to really make a lot of people’s life easier. So you can do things like cumulative sum. It can be as simple as you have a sensor on top of a door and then you are counting how many people walking through, and then you want to say per day basis how many people are walking through. That’s a cumulative sum of the entering the door event in this case, window function will solve that use case very easily. Besides window function, let’s talk about the one exciting thing maybe in the query engine portion. The query engine portion, we are aiming to improve how the database is performing against really high cardinality data sets. Today, Druid has reasonable performance using a lot of techniques around approximation to do large high cardinality Group Bys. But we have a lot of use cases and we’re seeing a lot of users who want to do exact Group Bys on ready high coordinated columns.
[00:25:45.300] – Will Xu
For example, I want to do analysis of network traffic flow, or I want to do analysis of users conversion journey behaviors that require some kind of exact Group By computation. And this is one of the core engine areas that we are going to revamp to make sure it’s ready ready high performance.
[00:26:05.590] – Will Xu
On the data management portion, besides everything that we’ve talked about on schema auto discovery, we’ve talked about ability to do smart segment assignment. One other important thing that we want to do this year is to support Spark as a tooling to either load data or to read data from Druid. As you can imagine, a lot of big data workshops has adopted Spark as their standardized ETL solution, and we definitely want to embrace that and make sure people can leverage their existing tool effectively with Druid as a database. And then on the operations side before end of the year, I would imagine the highest impact work that we’re going to focus on is going to be bringing the Kubernetes based ingestion to fruition and make that generally available. So those are the exciting things that’s. Coming in the roadmap.
[00:27:00.620] – Reena Leone
And those are definitely very exciting. And also, like Will said, this is all publicly available on the Druid GitHub, and I will link to that as well if anyone wants to take a look, or even better, if you want to get involved and help bring these things to light. I mean, this has been a great conversation. Thank you, Will, for joining me and talking me through everything in the Druid 27 release. It’s been wonderful to finally get you on the show.
[00:27:26.580] – Will Xu
Absolutely. Yeah. I’m excited to see what the community is able to bring to us in the next few releases. I’m really excited for the rest of the year as well as next year.
[00:27:36.280] – Reena Leone
Speaking of Apache Druid, if you want to learn more about it, head over to druidapache.org. And you want to learn about what we’re up to here at Imply or to check out Will’s blog on the Druid 27 release, please visit imply.io. Until next time, keep it real.