Apache Druid 26.0: Breaking Down Druid’s Latest Release

May 24, 2023
Reena Leone

On this episode, Vadim Ogievetsky, CXO and co-founder of Imply discusses the new features in Druid 26.0: Schema auto discovery, arrays, and shuffle joins. Schema auto discovery simplifies data ingestion by allowing users to define a partial schema and letting Druid automatically discover the rest. Arrays now offer SQL-compatible representation and functions, making it easier to work with array data. Shuffle joins enable the joining of large datasets, overcoming scalability limitations. These features align Druid with other databases, enhancing its compatibility and reducing the learning curve for new Druid users.

Listen to the episode to learn:

  • How schema auto discovery works and how it differs from schemaless
  • How shuffle joins make getting data into Druid simpler, easier, and less expensive
  • How Druid 26  introduces the ability to parse metadata from Kafka messages, making it easier to extract useful information.

Learn more:

About the guest

Vadim is one of the original authors of the open source Apache Druid® project and co-founder and CXO at Imply. Prior to Imply he worked at Metamarkets (acquired by Snapchat). Vadim holds an M.S. in Computer Science from Stanford University. At Stanford he was part of the Data Visualization group where he contributed to Protovis and D3.js.


[00:00:00.490] – Reena Leone

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, realtime data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. And today is exciting bonus episode time because Druid 26 is now available. To talk us through what’s included in this release, I’m joined by someone who goes way, way back with Druid, Vadim Ogievetsky , Apache Druid PMC and co founder of Imply.

[00:00:34.950] – Reena Leone

Vadim, welcome to the show. I always like to start off with asking people about their journey and their background, like how long they’ve used Druid for, but you are actually one of the first people to use Druid.

[00:00:47.010] – Vadim Ogievetsky

Yeah, well, Reena, thanks so much for having me on the show. I’m very excited to be here. As you said, I’ve used Druid for quite a while, probably in terms of time, at least longer than anybody else has, because I was the first one, and I still remember that day where I was sitting next to Eric, who is kind of like the person that incepted the Druid idea and started the project. And this one time I was sitting next to him in an office and he sent me a query, he sent me a link on HipChat, and I went to it and the link was an API call and it dumped a bunch of data at me and I was like, what is this? What’s going on? And he was like, you just ran a Druid query. Okay, nice. And it all started from there. So it was a beautiful moment. One of those moments that you don’t appreciate how beautiful it is until you look at it. In hindsight, we were lucky enough to.

[00:01:59.050] – Reena Leone

Have Cheddar on the show, and he kind of gave us the Druid origin story, and then we even talked about a little bit about what’s coming. But actually, what is already here is what we’re talking about today, which is Druid 26. So from version one to now, I am excited to dive into what’s new with it. The last couple of years, there’s been a tremendous amount of work on Druid. I feel like to make it easier to use, to improve ingestion and queries, extending the architecture, the multistage query engine has opened up a lot of doors. And then here we are today at 26. What are some of the key features of this release?

[00:02:38.760] – Vadim Ogievetsky

Yeah, well, as you said, there’s been so much work on Druid lately and so much cool stuff getting added. And one of the big things that we’re driving towards with the project is making it so that Druid has a SQL compatible interface to people. So, you know, SQL, you’re familiar with SQL databases, you come to Druid and you have a great level of familiarity. And I think the stuff that’s coming in this release is really in that theme. There’s kind of like two big things that I guess I want to talk about. There’s the knot of features around arrays. So there’s ability to store arrays as top level columns, ability to unnest those arrays, and then Schema Discovery, that will store arrays for you. So SQL compatible arrays as well, I should add. And then there’s also the ability to do shuffle joins. So joining a large amount of data to another large data set is not possible. And that’s very exciting. And obviously that’s something that kind of is expected from a SQL system. And here we are adding that. Obviously the way you interface with that is with the SQL join keyword. So I’m excited to dive into those things and I don’t know where do you want to start?

[00:04:10.720] – Reena Leone

Let’s start with Schema auto discovery and arrays. Let’s kind of talk through that because I feel like that’s one of the big ones. But first, can I ask you, is Schema Auto Discovery the same thing as schemaless?

[00:04:23.110] – Vadim Ogievetsky

I think yes. Whenever you see people say schemaless, they mean Schema auto discovery. I try not to use the word schemaless, and I discourage people from using it because Schema Auto Discovery is cooler than schemaless. I guess let’s jump to that first and then we’ll tie it all back. But yeah, Schema Auto Discovery basically lets you just say, instead of providing an explicit set of columns and saying, these are my columns and these are their types, you can just say, hey, go and detect them. And it’s really useful. It makes ingesting data into Druid really simple. You don’t have to write out your columns or do anything. And those columns could have variable different types, including arrays, which we’ll talk about in a little bit, and they’ll just be auto detected. Now, if you’re loading data in batch from files, obviously it saves you time from having to figure out what your columns are, auto text them. But the real, real awesomeness comes if you’re loading data from streaming, because there you have a stream, you connect to it once and you’ll be consuming it for years. There is no end date. And that stream will evolve with what columns you have in there, what Schema you have, and what the types of existing columns are.

[00:05:59.510] – Vadim Ogievetsky

And you want the database to be able to automatically discover new fields. And obviously, when I say it’s important for Druid to discover columns because it’s a columnar database, for performance reasons, everything is stored as a column. So we’re not just taking whatever data you get back, whatever data we ingest from the stream and writing it as is. We actually make it columnar, make it really fast and optimized. But the reason I don’t like the term schemaless is because it kind of implies that you have no Schema at all. But the real feature that’s actually part of Druid is more sophisticated because you can define part of a Schema, a partial Schema, and then you can say discover the rest. Right? So here’s like these, say, ten columns are really important to me. I have very specific ideas about what their types should be and specific properties, and I want to explicitly define them and everything else. I want it to be kind of discovered. Or you can say only those ten. So it gives you the flexibility, I guess, like schemaless to me invokes the idea that you don’t define anything and you just say discover everything.

[00:07:28.140] – Vadim Ogievetsky

But what the real thing that’s released lets you. I mean, you can do schemaless, but you can also do what I like to call flexible Schema or Schema Auto Discovery. So if you see any of those terms written, I think they’re all synonyms of each other.

[00:07:43.360] – Reena Leone

But I think the point is that this gives you the opportunity to choose how you want to do it.

[00:07:48.700] – Vadim Ogievetsky

Exactly. You have a dial that let’s say you have 100 columns. You can define 50 of them, you can define none of them, you can define all of them. Whatever makes sense in your use case, that is what you should do.

[00:08:05.850] – Reena Leone

So you mentioned also that what kind of goes along with this is arrays. Can you tell me a little bit more about that? And then also UNNEST. Those three things kind of all go together, right?

[00:08:16.860] – Vadim Ogievetsky

Oh, yeah, absolutely. Well, so to kind of understand this fully, we need to go back two releases ago to Druid 24, where we introduced the concept of nested data. This is like JSON arbitrarily nested columns. And obviously when you have that, you effectively have a little bit of Schema Auto discovery happening within that column by itself. And if you imagine a column just containing JSON values, one of the things that might well contain is arrays. So it might have array of strings, or array of this, array of that. And we want to store that as arrays. So one of the things that was introduced is this concept of being able to store arrays and then have expressions that return them. And when I say arrays, I mean SQL compatible arrays now, I want to pause here and I want to say that Druid for a long time supported the concept of what me and you might call arrays, but it represented it in this special way called MultiValue strings. And the main cool thing about MultiValue strings, the reason Druid implemented it the way it did, was because MultiValue strings could have when they just had one value in them, they were just strings.

[00:09:55.970] – Vadim Ogievetsky

So they were like a generalization over strings. This is not true for arrays. Like, an array with one value in it is a different thing than just a single value. But it worked. And MultiValue strings, they have advantages and disadvantages but one clear advantage to them, they kind of have this ability to automatically unnest themselves. And this is why unnest is going to become super useful. Because to make arrays behave in this way, where when you operate on the individual values of the array, instead of the whole array as a whole, you need to use the unnest function. So we introduced this concept of arrays in Druid 24, but now you can have arrays as actual top level columns. You can actually have a column in your data, maybe tags, and you store it as an array. And the biggest advantage of arrays is that the way it’s implemented is very, very SQL standard. So this MultiValue thing, even though it has this automatic unnest built into it and this coolness, it’s very hard to deal with from the SQL endpoint. If you access Druid through SQL, it’s going to be like unpleasant to work with.

[00:11:33.090] – Vadim Ogievetsky

MultiValue strings and arrays make it completely SQL compatible there.

[00:11:38.750] – Reena Leone

Oh, that’s fantastic. Adding arrays in this type of way, do they have additional benefits beyond just SQL compatibility?

[00:11:46.300] – Vadim Ogievetsky

Yeah, they have benefits. I think they can kind of more. I mean, if you have array data in your data, in your input data, then representing it as arrays is more truthful than representing it as MultiValue strings, which is what Druid would have done before. In particular, there’s a concept of an empty array, or you can have your array be null, or you can have an array with a single value of null in it. And those are all different things that we can now kind of represent those different states. And before those were all kind of munched together into one state and that was suboptimal. It’s a small thing in the grand scheme of things, but I’m a big stickler for representing things as they are and it’s just a much neater picture. And if you couple with that, the way that you use it should be very familiar from SQL. We added a lot of array functions that operate on them that, again, should be very familiar to everybody. I think it creates a much nicer picture.

[00:13:08.680] – Reena Leone

Just so I’m clear. So arrays, Unnest, and then Schema, Auto, Discovery are all kind of like a total package to help understand and sort your data better and be more compatible with SQL.

[00:13:22.060] – Vadim Ogievetsky

Yeah, I mean, they all work.

[00:15:08.750] – Reena Leone

Okay, I get you. Yeah. Because sometimes Druid kind of does things in a very Druid way that kind of increases the learning curve when you’re moving from, say, like a different database.

[00:15:19.490] – Vadim Ogievetsky

Well, this whole project of SQL compatibility, it’s taming the Druid way and bringing it more in line with how other databases do stuff and unlocking new features along the way as well. But yeah, I think that’s the core of that project, which is why everything we’re talking about here really fits into that.

[00:15:49.580] – Reena Leone

Which brings me to the kind of next big feature set of features, which is shuffle joins. Let’s talk about that for a minute, because Druid’s been able to do broadcast joins for a little bit now, but with those, there’s a limit on the scale of data that you can join. How are the shuffle joins different?

[00:16:06.780] – Vadim Ogievetsky

Yeah, so Druid has been able to do broadcast joins for some time. I can’t even recall the release when it was introduced.

[00:16:15.690] – Reena Leone

I think it’s 14.

[00:16:17.350] – Vadim Ogievetsky

It was a long time ago. And even before that, Druid had the concept of lookups, but that were also internally broadcast. So the thing that a broadcast join is good for is joining a large set of data, your main fact table to a small lookup style data set. So, for example, maybe you have customer IDs in your main data set and this is how you store them. And you want to denormalize it and get customer names in your data set either instead of or as well as the customer IDs. So that’s a good use case for broadcast joins. But there are many use cases where you want to do a join, but both sides of the join, both the left and the right, are large, so you can’t really take one side and send it to all the servers because it’s not going to fit there. So a shuffle join basically shuffles the appropriate data to the appropriate servers, and no single server ends up seeing the whole data set from either side of the join, which lets us do much bigger joins. So I think one very common example, I’m a very visual person, so I like to paint examples, is imagine you have an auction style data set.

[00:17:56.120] – Vadim Ogievetsky

This is very common in ad tech, for example, where you have auctions and you have bids on the auctions. You’re going to have a lot of auctions, but you’re going to have even more bids because each auction probably garnered several bids. And if you want to do…or in the same realm, you might also have auctions and stuff that ads that were served and then clicks, like ads that were clicked. So now you have your auctions, your bids, and your clicks. And those are three big data sets that neither of them can be broadcast anywhere. So before that, if you were wanting to do analytics on top of Druid of this data, you would have had to have another system to preprocess that data for you. You’d have to do this join in some other technology. There’s many things that could have done it for you, but now you don’t need to. You could just put that in Druid and have Druid do that join for you.

[00:19:07.440] – Reena Leone

And this would happen typically when? At ingestion?

[00:19:11.690] – Vadim Ogievetsky

Yeah, well, so going back to releases to Druid 24 is when we introduced the Multistage Query Framework architecture that basically lets you compute these big, big queries. Now since it was introduced to that is what Druid’s SQL based ingestion is built on top of. So one of the cool newish features of Druid, not in this release, but in a recent release, is that you can specify your ingestion with SQL and it runs in this multistage query architecture. And the shuffle joins are built on top of MSQ. So right now you would do it at Ingestion time. So exactly. As I said, let’s say if you wanted to use Druid and you had your big auction data set and your bids data set and your clicks data set, and you wanted to calculate a click through rate, and how many bids do you have per auction, you would do that at ingestion with multi stage query. But one of the things that is kind of upcoming in that world is that we want to make that usable also just at query time, which it is usable today. It’s an experimental part of it. Using it for ingestion is fully supported.

[00:20:50.480] – Vadim Ogievetsky

Using it for query is experimental because that API will change in the future. But in the future we will settle on the proper API. And then you could do all of these transformations, including shuffle joins at query time.

[00:21:11.280] – Reena Leone

Well, that’s cool to hear. And it’s funny, like you mentioned auctions, and wasn’t that kind of the first real use case for Druid? Like, back in the day? Wasn’t it kind of where it came from?

[00:21:22.390] – Vadim Ogievetsky

Yeah, well, you know, I have, as you could probably tell from introduction, I have some nostalgia for the very early days, really fun times. And not that we’re not having fun.

[00:21:34.890] – Reena Leone

Now, but you tell me we’re having fun on the show.

[00:21:37.470] – Vadim Ogievetsky

This is a different kind of fun. And back then, when I ran the first query of Druid, it was on an auction data set. So I felt it was very appropriate to build, to kind of bring an example that brings Druid all the way to its very roots.

[00:21:59.050] – Reena Leone

I feel like doing this show has kind of started me down a path of being like the Druid historian as I keep track of all these milestones.

[00:22:08.210] – Vadim Ogievetsky

Oh, absolutely. Well, yeah, I have a lot of Druid history up here, so maybe I can come on the show again to talk about that.

[00:22:16.600] – Reena Leone

Yes, I would absolutely love that. Okay. Those are kind of like our we kind of talked about sort of the big things in 26, but are there any more I don’t want to say minor, I don’t want to downplay them, but are there other features that you’re excited about?

[00:22:30.730] – Vadim Ogievetsky

There’s a whole bunch of features, and I definitely recommend anybody interested to read through the release notes and to not just focus on the highlights, which I kind of covered right now, but to also look at all the little minor things that changed and improved. I want to shout out to one specific feature that I think is interesting, and it also changes a default, which is very exciting always because it means that even people that don’t listen to this podcast or read documentation or do anything, they will experience this whether they want it or not, unless they turn it off. So I guess whether they want it or whether they’re not paying attention.

[00:23:16.110] – Reena Leone

I like that you went from ominous to like a little less.

[00:23:20.170] – Vadim Ogievetsky

So one cool feature that was added in Druid 23 actually, so a few releases ago was this ability to parse out metadata from Kafka. So if you’re connecting joy to Kafka specifically, a Kafka message has a payload but it also has a key and it can have headers and a timestamp. So it’s really like a clown car in there. And since forever you could read the payload of a Kafka message. In Druid 23 there was this notion of a Kafka input format, a way to parse out all this extra details in a Kafka message that was added. But it was kind of an obscure feature that you didn’t I mean, unless you specifically went looking for it, you probably wouldn’t have used it. We didn’t, I think, make a lot of fanfare about it. It was more like when somebody needs that something in particular, we tell them about it. But now a change that happened in Druid 26 is that the data loader so Druid has this web UI that is actually the reason why I’m no longer just a user of Druid. I’m also a developer because that’s where I contribute to the project.

[00:24:49.990] – Vadim Ogievetsky

So Druid has this web UI and it has a point and click data loader. So you can load files or Kafka topic just by clicking with your mouse on top of the screen and you can select having schema discovery on or off. That’s one of the toggles. But one of the interesting things is that if you now connect Druid to Kafka, if you go through the web flow to set up a new Kafka load, by default it will actually use this Kafka input format. So it will be parsing out this metadata for you, which is really neat. I mean, you might have some useful things in there and now it’s just going to appear for you without kind of having to do anything else. And I think, I think that’s something that a lot of users will benefit from and probably won’t even know or realize that something changed consciously. So I love features like that. I love features that just silently make something a little bit better.

[00:26:05.630] – Reena Leone

Well, sometimes you don’t know what you’re looking for until you’re looking at it. Right. If you’re already pulling out all the metadata, you may not have thought to do that before, but now that you can now that those are data points that you want.

[00:26:22.310] – Vadim Ogievetsky

Exactly. Absolutely. And one important thing is that one specific place where people really benefit from this is that Druid really benefits from having some notion of timestamp attached to each event that it ingests. And yeah, you unlock a lot of usefulness if you have that. And usually that comes from the data. But what if your data doesn’t have a timestamp? Well, if your data is coming from Kafka, then Kafka has a timestamp of when your data was inserted into Kafka, it was like the insert time. And now, because that’s automatically being kind of surfaced into the data, if you don’t have a timestamp in your data, that one will be used as the primary timestamp by default. So it lets users get into a better state without thinking. And that’s exactly why I wanted to shout out to this feature. Obviously not at the same level as the arrays or unnest of the shuffle joins, but still a nice to have.

[00:27:26.170] – Reena Leone

Yeah, but it all depends on who’s using what, right? So the little features also are what make Druid so awesome. And to your point, this is something that’s going to be by default, and in a way, if it’s already doing it, it’s making it simpler. It’s one less thing that people have to figure out on their own or do manually.

[00:27:44.910] – Vadim Ogievetsky

Absolutely. That’s exactly that.

[00:27:47.090] – Reena Leone

Is there anything else that we can cover for Druid 26 today, or are we all going to have to just wait for like 27 and beyond?

[00:27:55.170] – Vadim Ogievetsky

Well, I did want to cover one more thing. Just a word of caution for people that maybe listen to this podcast and they’re just very excited. We mentioned arrays and Schema Auto Discovery and how arrays are different from MultiValue strings. Well, the reason that Schema Auto Discovery ties into arrays is that if you use it, if you have arrays in your data, like, for example, tags for an article, then you will get arrays coming out in Druid, and that’s great. If you’re just starting to use Druid, you’re loading a new data set. That’s perfect. That’s exactly what we want. If you already have an ingestion that’s set up today and it’s running and it’s doing stuff, and you have arrays in your data today, it’s probably ingesting stuff as multi value strings. And if you just switch those to arrays cold turkey, basically, if you just switch to this auto discovery and they become arrays, it’s possible that if you just do it, there might be something that relies maybe you have an app that relies on these columns actually being MultiValue strings and not arrays, and you might have some confusion coming out of that.

[00:29:28.830] – Vadim Ogievetsky

So if you are excited about the Schema Discovery and you want to upgrade an existing use case, just check out to make sure that your MultiValue strings turning into arrays will be all cool with everything that’s consuming that data. We’re going to detail a migration path about that, but it’s just definitely something that to be aware of.

[00:29:57.370] – Reena Leone

Well, then that’s good to know. Just so someone’s like, what just happened? Why is this set up this way?

[00:30:03.130] – Vadim Ogievetsky

Absolutely. But otherwise, yeah. Please go and have a fun time with these features.

[00:30:09.910] – Reena Leone

That’s what we always say. Like, the best thing to do is just get into Druid and try it out for yourself. Or if you’re listening to this podcast and you’re new to Druid, go download it. Go try it. You will not have this problem because arrays will be the default for you.

[00:30:24.370] – Vadim Ogievetsky

There you go.

[00:30:25.180] – Reena Leone

Vadim, thank you so much for coming on the show. I’m so excited to finally have you. If anybody listening wants to know more about Druid 26 or anything coming with Druid, please visit druid.apache.org. And if you want to know more about Imply, please visit imply.io. Until next time, keep it real.

Let us help with your analytics apps

Request a Demo