Inside Apache Druid 29.0: Getting up to Speed on Druid’s Performance, Ecosystem, and SQL Compliance with Sergio Ferragut
On this episode, we’re back with yet another Apache Druid release – Druid 29.0.0! Sergio Ferragut, Senior Developer Advocate at Imply returns one last time to walk us through improvements and enhancements to Druid’s performance, ecosystem, and SQL standard compliance.
We dive into new capabilities in Druid 29.0, such as EARLIEST / LATEST support for numerical columns, system fields ingestion for enhanced batch ingestion and additional support for arrays, including UNNEST and JSON_QUERY_ARRAY. And it wouldn’t be an open source project without community-contributed extensions. Spectator Histogram and DDsketch have been introduced in this release, which offer efficient quantile calculations and support for long-tailed distributions.
Listen to this episode to learn more about:
- Multi-stage query engine (MSQ) enhancements, including Azure blob storage as well as GCP blob storage
- How PIVOT / UNPIVOT expands on SQL coverage for Druid
- Ingesting directly from DeltaLake and querying Delta Lake tables directly from Druid SQL
Learn more
About the Guest
Sergio Ferragut is a database veteran turned Developer Advocate. His experience includes 16 years at Teradata in professional services and engineering roles. He has direct experience in building analytics applications spanning the retail, supply chain, pricing optimization and IoT spaces. Sergio has worked at multiple technology start-ups including APL and Splice Machine, where he helped guide product design and field messaging. He joined Imply as a Developer Advocate in 2021 to help and to learn from the Apache Druid open source community.
Transcript
[00:00:00.250] – Reena Leone
Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid realtime data and analytics, but way beyond your basic BI. I am talking about analytics applications that are taking data and insights to a whole new level.
[00:00:18.360] – Reena Leone
And we are back after a little break with Apache Druid 29. I swear I didn’t plan it this way to go for release to release, but here we are. Blame it on the Druid community cranking out the releases. This time around. There were 350 commits and 67 contributors. So thank you guys. Thank you for the work that you do in making Druid what it is. Through the last few releases, features, improvements in the roadmap, there have kind of been like three themes, I’d say, that have emerged. They are performance ecosystem and then SQL standard compliance. And as we kind of talk through the main features, MSQ enhancements, extensions, et cetera, they’ll align to at least one of those. But I don’t want to spoil the episode by getting ahead of myself. So let me introduce our guest today.
[00:01:07.190] – Reena Leone
Joining me returning again is Sergio Ferragut, senior developer advocate at Imply. Sergio, welcome back to the show.
[00:01:14.940] – Sergio Ferragut
Hi Reena, thanks for having me.
[00:01:16.820] – Reena Leone
Awesome. So, okay, Sergio, you’ve been here, we’ve talked a lot, especially about joins. Swear I’m not going to talk about joins this show, but since this is your first time covering a druid release with me, I would love to get a little bit of an idea of your history with Druid, especially since you were in the community talking to folks every day. So refresh my memory, which is the first version of Druid that you used.
[00:01:46.130] – Sergio Ferragut
I started two and a half years ago and the release was 0.18.0.
[00:01:53.110] – Reena Leone
Yeah, before we moved the decimal point.
[00:01:55.410] – Sergio Ferragut
Right, before we moved the decimal point. And the things that primary things that have changed since then. There’s been a huge development around SQL based ingestion and another query engine called MSQ. So that’s one of the major changes that occurred that first appeared on [Druid] 24. And there’s also been a lot of evolution of Kubernetes support, particularly including the elasticity of ingestion with Kubernetes jobs based ingestion that allows you to full elasticity of tasks.
[00:02:35.100] – Reena Leone
We love talking about Druid and Kubernetes on this show, as I’m sure you are well aware. All right, speaking of things that we love to talk about on this show, Druid releases. Okay, so here we are, we’re at 29. This is the first one of 2024 and I think I’d like to start off with something you mentioned, multistage query engine. So we had some features and improvements related to it since that has been a big part of this new Druid since 2022. In this release, I know we have introduced support for Azure Blob storage as well as GCP blob storage as options to enable a fault tolerance on Azure and GPC clouds, which is cool, since when MSQ launched, I think it was only AWS S3 right?
[00:03:23.310] – Sergio Ferragut
That you could do right when MSQ launched. So it has the ability to use the local storage for intermediate storage files because it works in multiple stages. If you lose the node then you wouldn’t be able to pick up from a previous stage. So it initially had the option of storing intermediate files that it works on on S3 in order to provide checkpointing and higher availability of those jobs and recovery of those jobs. So now it’s been expanded to also support that on Azure and GCP. So that’s good news for Azure and GCP users.
[00:04:02.690] – Reena Leone
Awesome. Okay, so sticking with MSQ for a know and going back to one of the key things I mentioned at the top of the show, ecosystem, like what is an open source technology without an ecosystem? One thing that we’ve added in regards to that is export, which is experimental right now, but hopefully GA in the future. What can you tell me about that?
[00:04:25.750] – Sergio Ferragut
Right, so the data that gets ingested into Druid, typically there’s some value added to it, right? There’s some aggregation occurring with streaming data, or there’s some transformations in batch ingestion or even in streaming as well. So the data is enhanced in some way by coming into Druid. And there are many downstream systems that can benefit from those data transformations or even additional transformations at query time. So what export allows you to do is parallelize the execution of a query using multistage query engine and export the results to external files on an S3 bucket. As an example, right now, like you said, it’s experimental. The first format that it supports is CSV, but we’re looking at adding other support for other file formats so that it can be exported to like Parquet files, ORC files and so forth.
[00:05:22.690] – Reena Leone
I should also mention that if you have a file format that you are interested in, you should let us know, you should talk to the community, you should join us in Slack. Let us know what should be included.
[00:05:36.970] – Reena Leone
Speaking of Excel related terminology, another new feature is my favorite one to say PIVOT / UNPIVOT. What can you tell me about that?
[00:05:47.420] – Sergio Ferragut
So this goes along the lines of SQL compliance, right, and enabling other kinds of SQL operations. So pivot and unpivot are SQL operators that just the form of the data. So pivot allows you to take rows, data that’s in rows and convert it into columns. So if you have like a very long thin report, you can instead reorganize the data so you have more columns and a shorter report. Essentially by taking the values of rows and pivoting them into columns. The unpivot is the opposite, right. You may have multiple columns that you want to operate on in a single way. Maybe you want to sort the values of multiple columns and then look at the sorted set as a single column. So it allows you to take multiple columns and convert them into rows by merging those values into a single column and therefore having more rows.
[00:06:43.110] – Reena Leone
This is like such a dorky thing, but I get phrases stuck in my head and I just say them around my house and one of them is me just going pivot, unpivot, pivot unpivot, pivot. And so now I can’t hear it or see it written out without my head going pivot, unpivot, pivot, unpivot.
[00:07:04.390] – Sergio Ferragut
That’ll become our slogan.
[00:07:07.210] – Reena Leone
I’m trying not to create taglines here. Okay, all right. Staying on task, staying on task. Another addition to Druid 29 is earliest/latest support for numerical columns. I think previously you were limited to using numerical last/first aggregators at query time only, but now folks can use them on the ingestion side as well, which is super cool. What else can you tell me about this?
[00:07:34.160] – Sergio Ferragut
Right, so to describe that a little further, yes. So earliest and latest are aggregate functions, right? So you use them in a group by, and they’ll give you either the earliest value of a given column or the latest value of a given column. Now doing that at query time means that you’re going through all of the raw data. If you’re doing roll up ingestion in the past you couldn’t use earliest and latest on numeric columns and store the rolled up version of that, but that’s been enhanced now. And you can use earliest and latest at ingestion time, which stores it will store a data structure, a partial aggregation, so that you can then aggregate further at query time. But it does let you reduce rows significantly by enabling a roll up at ingestion.
[00:08:24.490] – Reena Leone
So I hear you mentioning ingestion. And another one I think you and I actually talked about recently was system fields ingestion. How does this change ingestion for folks?
[00:08:35.150] – Sergio Ferragut
Well, it actually enhances batch ingestion quite a bit, right. Because particularly in data lakes like parquet file formats are common. The data is partitioned along potentially multiple dimensions. And what that ends up building is folders within S3 bucket, as an example, right, as a storage mechanism. And the subfolders themselves are names of columns. So it says like customer ID, equal to four. If you’re partitioning on customer and you have a partition per customer ID, you’ll have a folder that says that. And then under that you may have another folder that says, I don’t know, sales, region, whatever other column you’re partitioning the data on. And that’s used for pruning at query time on data lakes, right? So it helps that, but when you’re ingesting from there into Druid, you normally don’t have access to those fields if you’re just reading the files. So that forces people to transform the data and include those columns in the data files before ingesting into Druid. With this new feature, it’s called ingesting system fields. It gives you full access to the path of the full path of the files and the file name of the files that you’re ingesting, even if you’re ingesting a set of files, so that you can transform those as part of your batch ingestion and extract the customer ID or the region or whatever fields are in the path of the ingestion.
[00:10:05.820] – Sergio Ferragut
So this completely avoids the need to transform the data ahead of time. You can directly ingest these files and grab those additional columns from the folder names. So it becomes very useful for ingesting from Delta Lake or from any data lake.
[00:10:22.270] – Reena Leone
That’s great, because I know we focus a lot on streaming data, right, and things coming in from streaming, but so many folks are using Druid with batch, so it’s nice to see that that has been added. Okay, we were talking about Delta Lakes. We were talking about ecosystem support, kind of going back to that. In Druid 29, we’ve added support for sourcing data from Delta Lake. Why is this important? Why does this matter?
[00:10:52.680] – Sergio Ferragut
Well, we’re trying to look for more sources of data, right. And being able to better fit in different data ecosystems. So we introduced [Apache] Iceberg [support] in [Druid] 28, and in 29 we’ve introducing direct Delta Lake access. And given that MSQ is used for asynchronous queries as well, it actually enables not just ingestion from Delta Lake, but also direct queries that can use portions of data that are in Druid and also access Delta Lake directly to resolve queries. So it gives you a query engine to access Delta Lake and ingest data.
[00:11:34.750] – Reena Leone
From like, okay, shifting gears, I feel like one thing we haven’t mentioned so far is arrays, which I feel like we’ve been talking about for a few releases now. Each subsequent release seems to have additional functionality and support for them, which is super cool. So 29, this is no different. Let’s take a minute to talk about a UNNEST and JSON query array. Can you explain those a little bit?
[00:12:03.580] – Sergio Ferragut
Sure. So in general, you’re right, arrays have been evolving in Druid over the last few releases, right? Some array type functions were added on the query side, I believe on 27, some other functionality in 28. But what’s happening in 29 is that we can now ingest arrays directly. So if you have arrays in your source data, you can ingest arrays. This has the ability of indexing the individual values of the array. So you can filter on any of the values contained in the array data type very efficiently, like Druid is known to do, and also filter on the whole value of the, you know, the array has a list of, let’s say, cat names or looking for a set particular person who has three cats that are named Joe, Mary and Bob. And you can find the person who owns those three so very efficiently because it’s also filtered. The full values of the arrays are also indexed automatically. So that’s one side of it, right?
[00:13:11.740] – Reena Leone
Someone with those cat names is listening to this right now. How did they know that? Where are they getting their data from?
[00:13:20.450] – Sergio Ferragut
And the other side of it is we’ve also added support for nested arrays of objects. So we can now have arrays of JSON objects and use UNNEST to essentially expand a row that has an array of objects into all of the individual objects and query the individual fields within those objects as part of that unnest. So that enables a bunch of interesting use cases.
[00:13:50.500] – Reena Leone
Awesome. You know what’s interesting about this release? That we actually had a couple extensions. And one might sound familiar because it was contributed by Ben Sykes, who’s a software engineer at Netflix and has been a speaker at Druid summit the last few years. It’s called Spectator Histogram. I actually got a chance to speak to him directly about it and ask, how is this different from, say, data sketches? And it really comes down to storage and using way less memory during your queries / ingestions. But I think it’s better to use this is what he told me he said, we can effectively say that we can store more data in faster storage with the same cluster footprint. So from that perspective, if you’re using a lot of percentile queries, then it’s more performant. So as I was trying to understand Spectator Histogram a little more, that’s like, a little bit more in the weeds. What can you tell me a little bit about this extension?
[00:14:50.290] – Sergio Ferragut
So Ben Sykes is great, right? So he probably gave you a better answer than I can give you. What I understood of the work that Ben did is that, yes, it’s a different approach to doing quantile sketches, which is way more efficient in terms of memory footprint and storage footprint, but it has a few limitations. Right. So it can only use integer positive numbers as its data. Right. So if you are doing data sketch, a quantile data sketch that is only using positive integer numbers, this should be a no brainer to switch over to the Spectator Histogram, because it’s going to be more efficient. Right. And it provides pretty much the same levels of accuracy that the data sketch does. So it should be a no brainer, like I said. So that’s it. That’s my answer. That’s probably not as good as Ben’s.
[00:15:50.470] – Reena Leone
Yeah, well, he’s done some really detailed documentation about then, you know, he’s kind of been presenting the evolution of this and how they’ve been using it at Netflix over the last couple Druid Summits. So I’ll make sure to link out to those videos where he really goes into detail. It’s awesome to see this contributed to Druid as an extension in this release. So that’s pretty exciting. But you know what? That wasn’t the only one we had. Another community contributed extension. Is it DDsketch? Is that what we’re calling it? DDsketch? That’s what’s calling it, which makes sense because it comes from the folks at Datadog. So how does this compare to quantile sketches?
[00:16:35.150] – Sergio Ferragut
This one, it’s for a specific scenario. Right. And what I understood from what I’ve read, I haven’t actually used it myself, is that this does particularly well at calculating quantile values with long tailed distributions. Long tail distributions are the one that most of the data is accumulated at the lower level. And the long tail is. So there’s higher values, but very few higher values towards the end. So it has a very long tail. And quantile approximation algorithms tend to struggle with such skewed results. So this one is particularly good at dealing with that kind of skewed distribution.
[00:17:15.780] – Reena Leone
I should just mention this now. Thank you again to the community for all of your contributions, all of your commits, all the work that you do for every release and really cranking them out. Open Source Druid would be not in the place that it is now without a dedicated community of engineers helping to make it be the best real time analytics database that it can be.
[00:17:40.690] – Reena Leone
Okay, I think. Have I missed anything? Sergio, is there anything else that we should touch on for this release? Off the top of my head, I feel like we’ve gone through kind of all the major points.
[00:17:53.640] – Sergio Ferragut
I believe we have. Yes, at least the major ones. Like you said, there’s a ton of commits, so there’s a lot of fixes and minor adjustments to things that we’re not touching on. But yeah, those are the major changes in Druid 29.
[00:18:09.870] – Reena Leone
I mean, if people want it, I will go through the release notes on an episode, but I feel like enticing content. Here we go. This is going to be a three hour podcast. I’m going in detail. How much time do you have?
[00:18:28.000] – Reena Leone
Okay, since I have you here and you’re part of the Dev Rel team with me, I think I want to wrap this episode up, which we’ve done before, but just reminding people where they can learn more about Druid 29 or just Druid in general. So if someone is totally new and they’ve stumbled upon this podcast, where should they go to learn about this stuff?
[00:18:50.240] – Sergio Ferragut
We’ve got some basic courses on Druid to start with in the learn.imply.io website where you can take hands on courses using Druid. The other thing that we’ve been working on, and we continue to work on release by release, including Druid 29, is the learn-druid [Github] repo. This is a repository where it’s based on a docker compose that builds up Kafka. Well, brings up Kafka, Druid, Jupyter notebooks, a data generator, and you can use that environment within the notebooks that it provides to go into a lot of detail about how to do different ingestions, how to do different kinds of queries, where we’re to use approximations, how to use approximations. So we’ve got tons of examples of both ingestion and query in there, including most of the new features that we’ve talked about here today.
[00:19:44.380] – Reena Leone
Awesome. And then what if someone has used, say, Druid 0.18.0 and hasn’t used another one since, but wants to learn more about what’s new? Where should they go?
[00:19:56.700] – Sergio Ferragut
Well, like I said, the learn-druid repo is probably the best place because that’s where we’re putting all of the examples from all of the releases. So we’re trying to build a comprehensive learning environment in the learn-druid repo that includes functionality from all of those releases up to the latest. And a part of what we do in Dev Rel is keep that repo up to date.
[00:20:19.210] – Reena Leone
I’m not sure if I’ve asked you this before, but what’s one piece of advice you’d give to someone looking to get started with Druid that maybe you wish someone told you upfront when you started?
[00:20:31.260] – Sergio Ferragut
Well, when I started batch ingestion was not even SQL, right? It was JSON. So I’d say don’t even go for the JSON. If you’re doing batch ingestion, just go to SQL. If you’re a database guy like me, SQL is your main language and the SQL based ingestion functionality is phenomenal in Druid. It really allows you to do all sorts of transformation at ingestion time, access external files, do joins all sorts of stuff that lets you transform the data before it hits Druid.
[00:21:01.650] – Sergio Ferragut
The other thing I’d say, and I tend to say to people, if you’re using open source Druid or you want to do a PoC on open source Druid, Druid is. Bringing up a druid cluster for a POC is not a simple task, right? You will need to configure it and to tune it for a particular workload. So if you want to avoid that, I’d say go for [Imply] Polaris. Polaris is our SaaS solution. There’s free credits in it that you can use and you don’t have to worry about setting up Druid, right? Druid is set up for you so you can just start ingesting data and querying. So if you’re trying to test the technology, that’s the easiest way to do it.
[00:21:42.560] – Reena Leone
Why does my brain go? With great power comes a little bit of complexity. No, wait, that’s not the quote. That’s not it. I’m not going to get in trouble with copyright. Okay, Sergio, it has been amazing to have you back on the show. Thank you so much for joining me and I’m sure you’ll see us working together in the future on a number of Dev Rel programs.
[00:22:07.440] – Reena Leone
If you listening at home want to learn more about Apache Druid, head over to druid.apache.org or the developer center at imply.io/developer. If you’re interested in learning more about what we’re up to, visit imply.io. There’s also a great Druid 29 release blog by previous guest Will Xu. Until next time, keep it real.