From ANSI SQL Support to Multi-topic Kafka Ingestion: What's New in Druid 28 with Will Xu

On this episode, Will Xu, Program Manager at Imply returns to discuss the release of Druid 28, highlighting the latest updates and features. We delve into the improvements in ANSI SQL support, making the query engine more compliant with the ANSI SQL standard without compromising performance. The release also includes additional Apache Calcite support, allowing for more performance improvements in query planning.

Window function, a Druid wish list item for many users, has been released as an experimental feature in Druid 28. This is also a significant step towards full ANSI SQL compliance, enabling users to perform trend analysis and bucket analysis on time series and event data. Additionally, features such as async queries and queries from deep storage have graduated from experimental to generally available (GA), offering a cost-effective way to query infrequently accessed data. The release also introduces enhancements in handling array data, Schema Auto Discovery, and SQL UNNEST, providing improved support for analyzing complex data sets.

In addition, listen to the episode to learn:

How to streamline streaming ingestion, such as multi-topic Apache Kafka ingestion and concurrent append and replace
How UNION ALL support allows for merging multiple tables together even if there are missing columns
When the 2024 Apache Druid roadmap will be published

Learn more

About the Guest

Will Xu is a product manager at Imply. Prior to joining Imply, his product management career has included roles at Apache Hive, Hbase, Phoenix as well as being the first product manager for Datadog. His experience with data and metrics includes product management for Facebook external metrics & Microsoft Windows data platform.

Transcript

[00:00:00.650] – Reena Leone

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid real time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. And we’ve got one more Druid release before we say goodbye to the monumental year that was 2023. Druid 28 is already here, and as always, super big thanks to the dedication of the Druid community. This release was made possible by over 420 commits from 57 contributors. You all are awesome. Thank you so much for making this happen. Joining me once again to talk about the Druid 28 release is Will Xu, Product Manager at Imply

[00:00:45.770] – Reena Leone

Will, welcome back to the show.

[00:00:47.680] – Will Xu

Excited to be here.

[00:00:48.940] – Reena Leone

Okay, so it seems like we just had you on talking about Druid 27, and here we are at Druid 28. Let’s get right into this release, starting with some of the latest updates to MSQ. One of those things was ANSI SQL support improvements. Let’s start there. Can you tell me about that?

[00:01:04.600] – Will Xu

Yeah, absolutely. So a lot of people out there, they are porting their workload from one database to another database. And one thing that ensures their workload works across different databases is the SQL standard. Or in the industry, people usually refer to ANSI SQL standard. Druid in itself, especially on the query engine side, is fairly compatible with the latest ANSI SQL spec. And what we are doing are some of the fundamental behavior changes to make sure engine is more compliant with the ANSI SQL standard. And some of the changes that we’re doing are around null value support, around tristate logic, and around Boolean support. Now one of the things that you might know about is a lot of database treats null and empty streams differently. So something is empty, isn’t quite empty because for example, a person can come in to register in a user portal and then this person might not have a last name due to cultural differences. And this person registering there without a last name is different than, say, a person didn’t register or didn’t put in last name. So the database somehow needs to be able to distinguish between those two conditions.

[00:02:20.440] – Will Xu

And historically to be able to process queries effectively with good performance, now you need to check two states, whether it’s empty or it doesn’t exist. And that is very cost prohibitive in terms of running expensive queries. And the changes we have done in this release is to make us not only very standard compliant so that we can distinguish between those two different states, but also do not impact any of our performance on the query side. And so when we talk about like ANSI SQL compliant, it’s ANSI SQL compliant without any performance penalty. So in this release we have introduced the default state for Boolean values to be very strict. We have introduced the tristate logic to properly support the no values and then also we have turned on the no value support by default into the production state. And there’s no performance penalty to do this whatsoever. So all those reasons combined to basically allows us to behave very much like any other database that is ANSI SQL standard compliant without the performance penalty.

[00:03:28.670] – Reena Leone

And I think one other improvement I think I remember from the release notes was also additional Apache Calcite support. Is that in there in this release?

[00:03:39.450] – Will Xu

Yeah. So Druid, like many Java based databases, used a library called Calcite to do query planning. And due to various reasons, the Calcite Library in Druid has been stuck on a very old version for a couple of years at this point. And that prevents us from leveraging some of the latest performance enhancements the Calicite library has introduced. So in this release we have updated the Calcite library to latest version. I think it’s 1.35 or 1.37 and that has all the latest and great features for Calcite. So on a tangential side you should be able to see some more performance improvements in terms of query planning. And over the course of next year we will be adding more integration into the Calcite engine to get the most out of the library itself. So this sets up us for success for 2024 for sure.

[00:04:38.110] – Reena Leone

Awesome. As I was going through, one of the things I noticed about this release is that a few experimental features have been added as well. A wish list item that we’ve talked about, window function. What’s the latest on that?

[00:04:51.790] – Will Xu

Yeah, so window function is quite exciting. It brings us one step closer to full ANSI SQL compliance and as part of the ANSI SQL standard, I think it’s like SQL 99 or 206, there’s introduction of Window function and we are fully supporting basically all the Postgres equivalent window function features in this release. As an experimental release. And what that means is for a lot of data sets where you’re looking for trends or you’re looking for bucket analysis for data, this becomes incredibly useful. One example I always give people is let’s imagine in the world where you have sensors on top of doors within the building and then you’re counting how many people are working through the doors. And then you want to understand across the entire day, cumulatively, what is the sensor that has most people walk through the door with and using window function, what you can do is you can bucket the people walking through the door by the sensor ID it has. And then you can say for a given sensor, say on the east side of the building, door number two. This is cumulatively how many people has walked through the door.

[00:06:10.130] – Will Xu

Because Window function allows you to essentially partition your data based on some dimensions of your choice. In this case the sensor ID, and then order the data within, in this case at what time the person has passed through the door and then perform operations on top of this, preordered data, pre production preordered data. And in this case you can basically do cumulative sum of how many people has worked through the door. It is generally very useful for a lot of analysis and it’s especially useful for data that people usually put into Druid, which are time series data or event data because all those data, you need some notion of partitioning or ordering for them to make sense. So not only window function we’re introducing in this release makes it more SQL standard compliant. It also allows a lot of common use cases that people would rather do on Druid today.

[00:07:03.990] – Reena Leone

Awesome. You know what, it’s fun to kind of see that got mentioned a few releases ago and we’ve been doing the show for about a year now and see where things were kind of ideas or things being worked on things on people’s wish lists for Druid. And now to come full circle to have them be added as an experimental feature is really cool to see. Speaking of experimental features, a few features have actually graduated from experimental to GA, like Async queries and queries from deep storage. Again, query from deep storage, another wish list item. So let’s kind of dive into what those look like in Druid 28. Let’s start with async queries and query from Deep storage.

[00:07:45.470] – Will Xu

Yeah, I’m excited to talk about this. We’re finally graduating this to generally available as a feature. Async queries. And querying from deep Storage really allows you to run queries in a task engine mode that’s separated from the Druid current regular query engine mode. And the way it works is you issue a query without having the data to be preloaded or precached. If your data, say, sit somewhere in a three storage bucket and you’re only querying this data once a week or once a month, you don’t necessarily want to pay for the ongoing storage and compute cost for it. We can spin up compute resources on demand and then query those data and then spin those resources down. It’s an extremely cost effective way for querying data and it’s very useful if you have scheduled reports, you have download workloads where people are accessing data infrequently. If you look at the other piece of feature that was built into the Druid engine itself in the past few releases is the native Kubernetes integration. The native Kubernetes integration allows you to spin up pausing Kubernetes on demand and then spin them down without having any persistent resources running.

[00:09:02.110] – Will Xu

So in this case, if you issue a asynchronous query, the asynchronous query will spin up the Kubernetes layer, automatically, find the compute resources, do the computation, spin them down and give the resources back. So it’s extremely, extremely cost effective for infrequently access data. And in this release, what makes this GA is now it supports a popular loading format. That means if you’re downloading hundreds of gigs of data, all of it is not going to crash and burn because everything is bottled on a single node. The system now is designed to handle really large scale result set and it has a lot of other redundancy, a lot of fault tolerance built in to make it function well in various cloud environments. So those are the things that we’ve been doing in the background since the last release, in the last three months. And then at this point we’re pretty comfortable that you can use this well in production kind of scenarios. And please do try it out and let us know.

[00:10:02.260] – Reena Leone

Well, also I should mention kind of the way that we’re talking about this release is a lot of these things work together, right? And build upon each other to make a better Druid. On that note, I’d like to talk a little bit more about arrays actually, because I know that was another thing that was improved upon in this release.

[00:10:23.930] – Will Xu

Yeah, for sure. So if you listen to the previous podcast we’ve talked about, we have made auto type reference schema discovery ingestion. It’s a mouthful to work. What that means is if you have data with complex data in that, we’ll auto load everything without you having to tell us what the data shape looked like, which is great, right? So if you have some random database, you’re like let’s load this into Druid to accelerate it easy, right? Just like SyDAC started to do it in your run and we auto figure out the right typing for all the columns. If it’s a number if it’s float or it’s like a string or whatnot. But one of the things that’s very tricky, what we found is handling arrays. What that means is if you look at a lot of common data out there, social media, for example, for every single post or picture you post, there’s a bunch of hashtags. The list of hashtags you have is essentially a array. Or if you look at like e commerce data, and then each order that user, very commonly there’s more than one item the user is ordering.

[00:11:33.330] – Will Xu

And then that is, an array. Druid historically has a lot of issues and challenges on handling arrays because going all the way back to the start of today’s episode, we talked about ANSI SQL standard compliance is treating empty and nonexistent is very different. So let’s say I ingested some order data. The order data has like no item in it. Is it because the user has submitted an order without an item? Or is it because the user submitted an order where the item’s name is empty? And now we’re able to distinguish between all those scenarios correctly so that your existing data can be ported very easily into Druid. And that makes the whole onboarding and data discovery experience super easy. And if you are a platform owner, you’re looking for integration to say, how do I put Druid into my overall platform workload without a lot of effort? This is one of those features that’s going to make your life super easy.

[00:12:38.610] – Reena Leone

While we’re kind of talking about things that have gone to GA and additional support. SQL UNNEST, I believe, went GA in this release. Is that correct?

[00:12:49.420] – Will Xu

Yes, yes SQL UNNEST went GA in this release as well. SQL UNNEST historically has been used for dealing with precisely array data, which is what we’ve just talked about. If you have a bunch of array data, how do you analyze it in a query? And UNNEST basically allows you to, as the name suggests, unnest the array data and then flatten them into a bunch of rows so that you can do normal operations on them. And then shipping the array discovery in this release as a generally available feature, as well as the UNNEST as a generally available feature basically put things together to allow you to auto discovery, auto ingest and easily analyze array data, which is like out there. It’s a very common shape, like in everyday real world scenarios. And then all those things cumulate together in this release to make that experience good for people who want to analyze those kinds of data.

[00:13:47.900] – Reena Leone

Another cool feature worth mentioning actually is multi topic Apache Kafka ingestion, because it kind of removes the need for dedicated resources in Druid to consume low traffic topics or accept higher ingestion latency to spin up tasks on demand. So how does that exactly work in Druid 28?

[00:14:07.730] – Will Xu

Yeah, so let’s talk about streaming ingestion, right? A little bit.

[00:14:11.730] – Reena Leone

Love that topic. Love that, Will!

[00:14:13.410] – Will Xu

Yeah, absolutely.

[00:14:16.090] – Reena Leone

We talk about it a lot.

[00:14:18.130] – Will Xu

It’s the bread and butter for Druid. A lot of people use Druid for the purpose because it supports you to query data that’s coming from Kafka or Kinesis or like other streams directly. And one thing that we’ve noticed over time is a lot of organization. If they start adopting Kafka as like a streaming platform, they have a few very big and very busy topics where majority of their business data are coming through. But there are a lot of cases where people are seeing lower volume topics in their Kafka environment. Some of them goes to the extreme to say like one or two events a day as a topic. Now Druid originally was designed as a system to maintain really low latency. So that means if your event goes into a Kafka, it will be ingested into Druid almost like instantaneously and made available for querying. And to maintain that kind of latency, Druid needs to basically run a process 24/7 to monitor the Kafka topic as events come in. And then as the events come in, Druid immediately picks up and then spin this out for doing query. But keeping a process up and running 24/7 for a topic where one or two events happen a day is very expensive.

[00:15:44.070] – Will Xu

It’s very cost prohibitive. You’re essentially wasting a CPU core for majority of the day, which are super expensive. So what can we do in this case, right? One option is we can spin the core system, monitor something and then spin something back up that introduce a lot of latency. That means Druid is no longer responsive as a real time system. That’s not very useful. So instead what we’re thinking about is what if we combine all those low volume topics together? So basically, instead of ingesting one topic, one event per day, you’re looking at 1000 topics, and each of them has like one event a day. And now that’s like a substantial amount of traffic in aggregate to make it worthwhile for you to spend like a CPU core monitoring all those thousand topics together. And this is what the multi-Kafka topic ingestion is about. It effectively takes a Reg X expression as the topic name. So anything that we can detect as a topic name we can start loading data from. And then you can combine a bunch of load traffic topics together into a single ingestion task. And then the output can be written to a single destination with a way for you to tell apart, like where topics coming from.

[00:17:03.750] – Will Xu

And what’s cool about this is because not all topics have matching data schema. You will have different data shapes going into different topics. And if you remember from the last time, we have introduced a feature for Schema Auto Discovery. That means if you don’t specify schema, we can auto discover the shape of your schema. And then in this release we have introduced the auto discovery of arrays. So if you combine multi-Kafka topic ingestion together with Schema auto discovery, now you can combine mixed content with mixed topic into data sources where it allows you to do analysis very easily. So this is how things are fit together to give you a much lower cost floor, if you may, for ingesting ready complex data set in the enterprise setting.

[00:17:53.390] – Reena Leone

Oh, you know what? This actually reminds me of one other thing that was in this release. When we’re speaking about streaming, ingestion and query performance is concurrent append and replace. That’s experimental, but that was added in this release, right?

[00:18:07.900] – Will Xu

Yes, absolutely. So as you are thinking about this, right? So Druid is a immutable OLAP data warehouse. What that means is everything is optimized for querying. The way I sort of usually explain this to people is imagine Druid is like a giant engine that does zip files. If you’re familiar with zip files, you have like a thousand tiny files, and then you’re zipping all the files together. And by default, let’s say you have one zip file per day, you have one for Monday, you have one for Tuesday, you have one for Wednesday, et cetera. And then when you query them, we unzip your file, we make it really fast. You process the data because zip files are tiny compared to the raw data itself, which is great right? Now, what if today is Friday? And then you suddenly started having people come in to say like we have some data that came in, just came in on Friday, that’s supposed to be in Monday. This is a very common phenomenon in any kind of like real world applications. People who return some stuff after they order, right? Or people who change their minds or data needs to be reconciliated.

[00:19:24.230] – Will Xu

You always have late arrival data. It’s a very common phenomenon. So how do you deal with late arrival data? Basically, you have to unzip the file on Monday, put all the new files together with the old files and then rezip it together. As you can imagine, like doing this for every single tiny file that’s arriving is very cost prohibitive and it’s near impossible because you constantly might have files arriving. What if in the process you’re trying to rezip the files, some new data has landed? Right, and this is what this feature is about. Like concurrent compaction with ingestion or concurrent compaction with append, which is like we will take care of the handling of this complex scenario where you have late arrival data coming in on constant basis. You’re also trying to unzip and rezip the file so that the query performance is very efficient. The experimental nature of this feature means there are still a lot of issues with it. Right now. There are certain scenario where it doesn’t quite behave well. We’re going to be continued to invest into this and then come back to us in next release. Obviously we will be talking more about this feature.

[00:20:39.710] – Will Xu

By the time that we make this generally available, what this will do is ensure the data files you have within Druid are always optimized in the shape where it has no fragmentation, and then your query performance is really good and it also will allow you to have streaming ingestion coming in in the same way as any other jobs. You are doing data processing within Druid while we’re optimizing the file format. And that’s sort of the underlying nature of the feature. And this goes back to the original topic of everything fits together. If you’re using Kafka topics streaming ingestion, you will have data from multiple topics arriving at different time intervals. And then they will have a lot of different dates where the data are sending and then they will always be like late arriving and then you will end up with fragrant data. But with this feature, essentially you will have your data in a much more optimized shape because we can operate on them while they are still sending. And now you have really good query performance even if you’re using this multi-Kafka topic ingestion.

[00:21:51.230] – Reena Leone

I’m trying to think about all the major points of this release. I feel like we’ve covered most of them. I mean, unless you want to kind of talk a little bit. I guess the last one is kind of UNION ALL that was in this release as well.

[00:22:03.780] – Will Xu

Yeah, so UNION ALL is a very unique thing out of all the features we’ve talked about. In the past, if you want to do any kind of query across different data sources in Druid or any other databases, for a matter of fact, use this feature called SQL UNION ALL. It basically does a combination of the tables so that they feel like as if it’s like a single table. Now, what if you have one table missing a column in this combined version? Right? Historically, Druid doesn’t behave well. Basically it basically says, sorry, you cannot combine those two tables because one of them is missing a column. But that’s sort of like an unrealistic expectation in the real world because when we’re combining tables, there always, always be missing columns. So what this feature does for this release, the MSQ UNION ALL under the hood, actually supports ability for you to query a bunch of tables together even if there’s missing columns and then be able to use them for reindexing purposes. And what that means is you can now use this feature as a way for merging multiple tables together as a single ingestion job. And it’s super useful if you have a bunch of tiny tables you’re trying to merge together.

[00:23:30.160] – Reena Leone

Yeah, we’ve been talking about UNION ALL in relation to unions versus joins when it comes to Druid and how to kind of negotiate those or navigate those within how Druid works.

[00:23:44.140] – Will Xu

Yeah

[00:23:44.860] – Reena Leone

But I was going to say it’s kind of funny. So where coming to the end of the year, I feel like I’m the Ghost of Druid past, present, and now let’s talk about future because this is our last release of 2023. But we already are kind of looking ahead into what’s going to happen for Druid in 2024, which at the time of this release is right around the corner. How is the roadmap shaping up for the next release? And by the way, folks, this will be public! Will, what can you tell me about that?

[00:24:18.290] – Will Xu

Yeah, so we are in the midst of coming together with the community’s input. What will be the roadmap look like for Druid for next year? If you’re interested, please do keep an eye on the Apache Druid GitHub Repo. We will be posting the draft as well as the final version of the roadmap there sometime in the January February time frame for 2024. And then that will be the baseline for what we’d be working towards against for the next year.

[00:24:50.410] – Reena Leone

Will, I think we’ve kind of covered all the major points for Druid 28. This has been great.

[00:25:01.640] – Will Xu

Yeah. I’m super excited for shipping this release together with the help of the team and then we will be continue working towards more future releases with the community.

[00:25:14.610] – Reena Leone

29 and 30 coming up in the new year.

[00:25:18.210] – Will Xu