Apache Druid News! Druid 30.0 is Live and Druid Summit 2024 Announced with Hugh Evans and Will Xu

Aug 01, 2024
Reena Leone
 

On this episode, we go all in on Druid 30.0 with co-host Hugh Evans and returning guest Will Xu. New features and enhancements in Druid 30 include improvements in ingestion experiences for Amazon Kinesis and Apache Kafka, better support for Delta Lake, and advancements in Google Cloud Storage and Azure Blob storage integrations.

And we get down to the details with technical enhancements such as GROUP BY and ORDER BY capabilities for complex columns, new IN and AND filters for faster query processing, and the stabilization of the concurrent append and replace API for handling late-arriving data in streaming. And for experimental features, we check in on the centralized data source schema feature for better performance and the introduction of TABLE APPEND syntax for UNION operations, which aligns more closely with SQL standards. 

Listen to this episode to learn more about:

  • The latest on arrays
  • When window functions will be GA (spoiler: not yet but almost there)
  • The benefit of upgrading Druid, from the numerous quality-of-life improvements to bug fixes and stability enhancements

Learn More

About the Author

Will Xu is a product manager at Imply. Prior to joining Imply, his product management career has included roles at Apache Hive, Hbase, Phoenix as well as being the first product manager for Datadog. His experience with data and metrics includes product management for Facebook external metrics & Microsoft Windows data platform.

Transcript

[00:00:00.000] – Reena Leone

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, real-time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. And today we are changing it up a little, and I have a co-host with me, my developer relations teammate, Hugh Evans. Hugh, welcome to the show.

[00:00:27.030] – Hugh Evans

Hi, Reena. It’s great to be here.

[00:00:28.940] – Reena Leone

So Hugh, since you’re new. I always like to start out with people’s journeys. So can you give me a high-level overview of your background and how you got to be my teammate?

[00:00:39.930] – Hugh Evans

Sure. So my first job back in 2018 was teaching workshops with the Raspberry Pi Foundation over the summer. And since then, I’ve really thrown myself into a tech career. I did an apprenticeship back at IBM, and then I spent a few years working as a consultant on data engineering and cloud infrastructure projects for some big retailers here in the UK. Since then, I’ve spent a lot of time in the London tech community, meeting lots of cool people and learning lots of cool stuff. I helped organize a meetup about AI called AI & Deep Learning for Enterprise. And yeah, off the back of that, I got into DevRel. I’m really enjoying it.

[00:01:14.350] – Reena Leone

Yeah, and I love having you as a member of the team. It’s been great. Actually, speaking of something that we’re working on together, big news. Druid Summit 2024 has been announced. And even bigger news, it’s happening in person. And in the San Francisco Bay Area in California. It’s taking place October 22nd, and we will both be there because we are on the committee and we are helping plan it. We are pretty excited. The Call for Speakers is now open. Hugh and I are looking through the submissions. If you have a Druid use case story, we would love to hear it. And then Hugh, you’re focusing more on Druid integrations with other technologies. That’s correct, right?

[00:02:01.450] – Hugh Evans

Yeah, that’s right. If you’re building something that isn’t necessarily 100% Druid focused, but does sit in the ecosystem and talk to Druid and get data from it and do cool things there, we’d love to hear from you. I think our community really would as well.

[00:02:13.380] – Reena Leone

And then also, if you aren’t really sure about what abstract to submit, you can come talk to us. We are here to help you define your story, run through your presentation, answer any questions you have about the event itself. So yeah, just hit us up. All of the links will be available where this podcast is posted. So feel free to reach out, and we are super happy to help you with that. And then we get to hang out in real life. So that will be super exciting.

[00:02:45.850] – Reena Leone

Speaking of exciting news, so why we’re doing the show today is because Druid 30 is live. And as always, this release was made possible by the Druid community. This time, it was 50 contributors who delivered over 400 commits that include new features, a bunch of improvements, bug fixes, the usual. We’re going to tackle everything in Druid 30 systematically, like we’ve done in the past, focusing on three main themes, again, performance, ecosystem integrations, and then, of course, SQL capabilities. And joining us to talk us through everything in Druid 30, returning guest, Will Xu, Product Manager at Imply. Will, welcome back to the show. I think you have the record as recurring guest here.

[00:03:29.770] – Will Xu

Very excited to be back. It’s very exciting to talk to you about all the new stuff that’s coming to Druid.

[00:03:36.400] – Reena Leone

Dare I say, you are a fan favorite.

[00:03:39.370] – Will Xu

Thank you.

[00:03:40.650] – Reena Leone

Okay, let’s get right into it. Druid 30, the big 3.0. Let’s just start with some of the ecosystem improvements, starting with some of the ingestion work that went into this release, because there was a lot. Will, can you tell us a little bit overall about the improvements to the ingestion experience in Druid 30?

[00:03:59.900] – Will Xu

Yeah, absolutely.

[00:04:00.920] – Will Xu

If you look at our ecosystem integrations, we really want to play nice with all the other kids on the block, as you say. On the streaming side, we have made substantial improvements for both [Amazon] Kinesis and [Apache] Kafka. There is a community-contributed extension from RabbitMQ folks that supports Rabbit MQ Super Stream as an alternative to Kafka. There’s also improvements on batch ingestions around supporting Delta Lake and supporting GCS and Azure and a few other clouds to allow us to easily and quickly source data from them. And all those are tied up nicely with an overall improved user experience. If you go into the Druid Web console, you will be able to see a lot more information around each individual ingestion task that you’re doing and be able to better troubleshoot or monitor their experience and performance. Overall, we have added a lot of improvements in the area with the aim to make it much easier for you to source data from various systems into Druid to help you accelerate the queries and the workflow you have today.

[00:05:12.410] – Hugh Evans

When it comes to Kafka, can you tell me a bit about the new parallel incremental second creation that’s coming in Druid 30?

[00:05:19.590] – Will Xu

Yeah, absolutely. With Kafka ingestion, if you have some very wide streams, in this case, what we call our event data has more than a thousand metrics or dimensions for each event that you’re loading. And during the ingestion at one point, we will need to checkpoint everything, put everything onto disk, do the compression, do the indexing, etc. And it’s a very time-intensive operation. And during that time, we cannot make any modifications or changes to the data. And then that in those large stream situations tends to increase the latency or the lag of the ingestion while the persisting is happening. The data persistent from stream into deep storage can take anywhere between a minute to five minutes or 10 minutes, depending on how much data you have in memory at that point. And it can contribute to increased lag when you’re querying data. So your query is no longer 10 seconds late in terms of latency. You’re talking about anywhere between a minute or even 10 minutes while the persisting is happening. This is mostly due to the fact of how many columns are involved in the persistence. And in the past, everything was done single-thread. So if you have a lot of data that happens to be persisting simultaneously, it really makes the situation much worse.

[00:06:46.990] – Will Xu

What happens in this release is very grateful for the team at Rivian, the electrical truck company. They contribute a patch to the system that allows the persistent to be parallelized. So in those large stream situations, usually you are working on a very large server, anywhere between 64 to 128 cores, and to be able to leverage all the cores to simultaneously do the persistence drastically shortens the window. So instead of it takes 128 seconds, now the entire thing takes a couple of seconds, like a second or two. And this becomes like a tiny on the monitoring system in terms of latency and substantially improves the experience of people who are carrying real-time data.

[00:07:38.030] – Reena Leone

Awesome. I love the folks at Rivian. Pramod has been a guest on stuff for us before, and I know they’ve been doing a lot of work with Kafka. But speaking of, while we’re on talk of streaming, I believe there were some improvements to Amazon Kinesis as well. What’s been done on that front?

[00:07:55.340] – Will Xu

For Kinesis ingestion, very similar to Kafka ingestion, we have also made substantial improvements in terms of scaling and reliability. Because Imply Polaris is a SaaS offering, and in the SaaS offering, we need to operate and work with people’s stream really reliably. There’s an expectation that when you ask Imply as a service to ingest your data from Kinesis, it should continue to work even without you monitoring it. In this release, we have introduced some more additional guardrails around dealing with Kinesis stream, and in this case, this is really there to help work with individual events that are very, very big, instead of having a much more strict sizing limit on how many rows you can have. In this case, we can restrict how much byte you have in the event that we pull off in the stream. And that provides a degree of flexibility where bigger events can be easily ingested without going through the guard rail limit. And that protects the integrity of the platform to make sure we don’t have enough memory. We have a lot of data that’s streaming in, while also ensuring that you can ingest the more complex and more comprehensive events in single blocks into the system, making it easier.

[00:09:17.720] – Will Xu

The other improvement around Kinesis ingestion is if you look at the way most streaming systems scale is by doing a thing called sharding. What it does is it divides your events into multiple pipelines and then sort it based on certain attributes. One example is if you’re tracking and monitoring users’ behavior on website, what you usually want to do is you want to put all the relevant events for a given user into a single pipeline so that when you’re doing analysis or processing, all the events from this user are sitting closely to each other and you don’t have to scan across pipelines. And each of those pipelines is what we call a shard. Historically, for Kinesis ingestion, the way the system determines the ingestion latency is by looking at latency across all the shards, and we use that for a lot of things like monitoring and learning, as well as more importantly, autoscaling. So if there’s a huge traffic spike, we can auto-scale the system accordingly.

[00:10:23.780] – Will Xu

The problem with doing the monitoring across more than one pipeline is sometimes when your visitors, they might somehow concentrate on one of the many shards, and suddenly one of the shards within your overall pipeline has a much higher latency, and then the auto-scaling system is not able to accommodate that because it’s looking at the overall latency, it looks pretty good, where this one thing is pretty bad.

[00:10:50.510] – Will Xu

So the change that was introduced into this release is we can now do the lag detection on a par-shard basis for Kinesis, or in Kafka case, a partition basis, which is already supported previously. And this way allows the auto-scaling system to be a lot more dynamic and be able to adjust and match your traffic spikes.

[00:11:15.510] – Reena Leone

Okay. And just to be clear, if you’re just listening to the show for the first time, you’re coming in on Druid 30, what we were just talking about, that is available in Imply Polaris, which is a database as a service that was built from Druid, correct?

[00:11:29.860] – Will Xu

Yes, that’s absolutely true. A lot of the improvements they are seeing on various systems are our lessons learned on operating this database as a service.

[00:11:39.070] – Reena Leone

One thing that we have previously mentioned in Druid releases is Delta Lake support. I think it was the last release, if I’m not mistaken, that ingestion from Delta Lake was added. Oh, right, because MSQ can also be used for async queries. That enabled querying from Delta Lake tables directly into Druid SQL. It’s coming back to me now. I’m remembering getting back to the current release. I know there were some data processing issues that were fixed with Delta Lake support. Can you tell me more about that?

[00:12:16.420] – Will Xu

Oh, yeah, absolutely. So we are very excited to work with larger data lakes. That’s where a lot of people’s data is sitting. And using Druid to accelerate the querying of those large data in the large data lake is a very common course. And in the previous release, we supported Delta Lake as a source with some limitations. One of the key limitations is when we try to ingest data from Delta Lake, we essentially have to process all the columns. You don’t have to ingest all the columns. Let’s say your table has like a thousand columns in Delta Lake, and you ask the Druid Delta Lake integration to say, Hey, let’s load 100 columns out of the thousand columns. If you’re able to do it, we’ll happily do so for you. The problem is, it still processes all the cells in columns, and that is very time consuming, especially for large, complex data lakes. Sometimes you even have a flexible schema upstream, and you end up with tens of thousands of columns that you have to process. The improvements in this release is we’re doing a technique called predictive push-up, and projection push down. What this means is we’re pushing all the filters into the upstream data lake before we start loading data from those lakes into it.

[00:13:39.060] – Will Xu

And it drastically reduced the amount of data that we have to process. Essentially, now we only process the data you ask us to load and ingest. So there’s no wastage of compute resources or network bandwidth. And as a result, loading and ingesting from Delta Lake is substantially faster.

[00:13:56.130] – Reena Leone

Continuing with improvements to things we previously talked about, I want to shift gears to something you mentioned earlier in this conversation, Google Cloud Storage. What is going on with GCS and MSQ, Druid’s multi-stage query framework?

[00:14:12.280] – Will Xu

Yeah. So I would like to start to talk about the two clouds in addition to the Amazon S3 together. For Amazon S3, it’s always been supported by Druid. We see it’s by far one of the most popular blob storage and most of our users are using Amazon S3. But at the same time, we’re seeing more and more organizations tackling their cloud strategy by spending on more than one cloud. Some of them are starting from Amazon and moving into Azure or Google. Some of them are starting from Google and moving into Amazon. And naturally, we want to provide them a fabric that allows them to easily deploy across clouds. The improvements that we have done in this release is focusing on supporting GCS as a blob storage destination, as well as richer in their support for Azure Blob storage. Now you can spend your data across multiple storage on Azure site, which is a very Azure-unique concept. And what this does is, in addition to the existing S3 support, really provide a cross-cloud fabric that allows you to easily query your data from any of the cloud and use any of the cloud as your underlying infrastructure for hosting Druid.

[00:15:32.230] – Will Xu

And all those changes is really there to give you the aura of flexibility.

[00:15:37.360] – Hugh Evans

So I know we were talking a bit about this the other week, Will, but whilst getting [the] learn-druid [Github repo] ready for [Druid] 30 I was looking at adding this stuff around GROUP BY and ORDER BY on complex columns. But do you mind talking a little bit about why GROUP BY and ORDER BY on complex columns is interesting for performance?

[00:15:56.820] – Will Xu

Yeah, for sure. Well, this interesting feature, the naming is very level-wide, GROUP BY and ORDER BY for complex columns. But maybe let’s take a step back and look at what is a real-world use case for this, and then hopefully it will make a lot more sense. When you are sending data into an event stream data store like Druid, oftentimes you would want to give your user the ability to add customized dimensions to tag the data. For example, if you are tracking the visitors to your website using that same example, and the visitor is coming to your site, you might be wanting to add an additional attribute for this visitor. For example, what is the geo-location they’re in, what is the items in their shopping cart, et cetera, as customized information alongside with their visitor data. So that when you’re doing analysis, you can say, Hey, I want to see everyone who has a certain item in their shopping cart, what is their workflow and user experience behavior journey on the website that I am currently monitoring or analyzing. And usually, those tagging data are very dynamic and very fluid, because there’s no fixed structure of what information that you want to add.

[00:17:17.810] – Will Xu

Today, it might be what’s in their shopping cart, tomorrow, it might be what is their local weather. And all those information are usually stored in an object or like a JSON block alongside data, and those are dimension data that’s coming in. Now, the issue with most databases is you have to keep those data lying around. If you want to do further analysis, you cannot collapse those records together because each one of them looks very different. The GROUP BY and ORDER BY on complex columns in this release is actually tackling exactly that issue. It essentially allows you to collapse and roll up multiple rows together, where some of the columns are those customized attributes. And now, instead of looking at individual website visits alongside with what items are in people’s shopping cart, you can look at, okay, what is this user’s behavior day to day or hour to hour by rolling up individual events into a cross or granularity bucket. And this substantially reduce the amount of data that you have to store in the database because instead of looking at the infinite millisecond level across a bunch of properties, now you’re looking at hour or day granularity, which can be 60, 100X reduction in terms of data, and it will drastically improve how fast you can serve the queries when you’re looking at analysis.

[00:18:44.390] – Will Xu

And this capability is very unique to Druid. With this release, the introduction of GROUP BY and ORDER BY complex columns makes it very easy for you to aggregate all those rows together and reduce the granularity of the data and reducing the amount of data you have to store and process. If you’re going from a second to an hour or a day granularity, we’re talking about anywhere between 60 to 100X reduction in data volume. And that’s pretty substantial. It makes serving the dashboard much easier. With Druid, the capability of doing GROUP BY on complex column is quite unique. In the previous two releases, we have introduced the capability of ingesting data with flexible schema schema. That means you can tag arbitrary objects to your event data, and Druid will happily ingest them without you having to manage or specify schema. But as I’ve mentioned just now, it does grow the volume of data in the database substantially. And what this does is it will reduce and compress the data for you to make it much lower impact in terms of storage and compute and make it easy to serve large complex objects off to it.

[00:20:02.450] – Hugh Evans

Nice. So all the usual benefits of GROUP BY and ORDER BY, but now for this?

[00:20:07.850] – Will Xu

Yeah, absolutely.

[00:20:09.490] – Hugh Evans

Super exciting. So on the topic of things that bring in performance improvements, another thing we’re getting in Druid 30, these new IN filters and AND filters. What’s going on with those?

[00:20:19.450] – Will Xu

Yeah, we strive for performance. Druid is a database. We care about being very, very fast and in a very cost-effective way, and we continue to add more and more features into the query engine to make things faster. The IN filter is very common as a use case. Let’s say you are doing a dashboard, you are asking to, Hey, five items in categories, A, C, D, E. And what you’re providing is essentially a list of categories to filter, and what we usually translate that to is an IN filter. The AND filter is also very similar in that what you’re looking for is, say, for people who have bought both item A and B. So you are looking for the intersection set between those two. We have made substantial improvements for both on the IN filter based on the testing that we have done, it’s about three times faster compared to before if your filter set is large. And if you’re looking at the AND filters, it’s a lot smarter, especially when you’re looking at the intersection set of two unequal set. What that means is imagine you are filtering for people who bought both item A and item B, and there are very, very few people who bought item B, but a lot of users who have bought item A.

[00:21:42.490] – Will Xu

So in this case, instead of computing people bought A and then intersecting it with people who bought B, the engine is smart enough to say, Let’s find everyone who has bought B, because it’s a very small set, and then apply the A filter on it, which is, Okay, all of the people who have already bought B, let’s find people who have them bought A. And then that speeds up the query substantially because now you don’t have to process entire A set in this case. So for both the IN filters and AND filters, the improvement there makes your query much faster. And it’s completely transparent to you. You don’t have to optimize your query. It’s the engine just to figure out the most efficient way for processing your query.

[00:22:26.460] – Hugh Evans

So on to concurrent, append, and replace improvements. So some news on that front. The API is stable now and it’s ready for production testing. Can you talk a bit about where we’ve come from to get there and why that’s exciting?

[00:22:40.140] – Will Xu

Yeah, absolutely. The concurrent append and replace, it’s a very crucial feature for dealing with data fragmentation. Historically, in Druid, if you have streaming data coming in and your data might not be arriving in sequence, some of your data from yesterday might come in today. In crazy cases, we have seen people’s data arriving as late as 30 days late. For example, if you own one of the electrical cars and your car might not be connecting to the Internet. So by the time it’s connecting to the Internet to report its telemetry data, the data can be a month or sometimes even older, being too late. And that causes data fragmentation because I can’t co-locate all your data from today with today’s data because it might be from yesterday or the day before. And historically, Druid has this process called compaction that allows you to merge essentially fragment data together. The problem is it cannot work together while the streaming data is writing because now they’re fighting each other. They’re like, I don’t know which one is true. Is the compact version that’s replacing the old data the truth, or the new data that’s just landed the truth?

[00:23:57.790] – Will Xu

And that caused a lot of problems in larger deployments where your data just stays fragmented because the system can’t figure it out. And what this feature does is it allows the defragmentation or compaction system to work alongside with the streaming system to share the underlying locking mechanism to allow compaction to happen. There’s no more fragmentation while your later on data can happily feed into the system. In the past, the feature flag to turn this feature on was quite complicated, and you have to do a lot of manual changes to enable it. It’s by design. We don’t want to make it too easy for people to turn up because we were afraid this might cause data corruption. But in this release, we have done enough testing on our end and at Imply, we’re actually also deploying this or deploy this actually onto a few internal services, and we have run them for a couple of months at this point, and are very confident in the quality of the feature. So we’re making it much, much, much easier for everyone involved to turn this on and start trying it. All you need to do is set one feature flag at the cluster level, and hopefully, you will no longer see any data fragmentation on your streaming data sources because now the defragmentation compression system can coexist happily with the streaming ingestion system.

[00:25:21.590] – Hugh Evans

That’s really cool. So that’s much broader access then to that experimental feature that’s going to really help people with late arriving data in streaming. Another really exciting experimental feature, this one again, looking more at performance, is this centralized data source schema thing. Can you tell me a bit more about what that is and why it’s going to be good for performance?

[00:25:41.630] – Will Xu

Oh, yeah, for sure. Centralized data source schema is still very much an experimental feature, but it’s a future direction, if you may, of where the database is heading to. Historically, Druid has been a very cloud-native database. Now, what that means is the entire thing is designed with the cloud in mind, and everything is architected as microservices. What that means is, imagine Druid has this component called the broker that does all the planning and coordination when you issue a query. It fends out the query you issue to it onto a bunch of historical nodes, which holds the data to do the processing so that you can analyze a query that you have with a vast amount of data. In order to plan and execute the queries that you give the broker, it has to be aware where the data is sitting on the individual historical notes. And to construct that view, each broker will run a query against the historical that it’s aware of and say, Hey, what data do you have on A? What data do you have on B, et cetera. Now, imagine you have a cluster of a hundred nodes or a thousand nodes, and you have five brokers. What happens is each broker have to run this query against each historical node.

[00:27:06.240] – Will Xu

The beauty of that is the system is completely decoupled. It’s very loose. So that means if any of the nodes dies or you’re adding those or doing scaling, it’s very easy because there’s no expectation you are available or being there. The downside, obviously, is it’s a very time-consuming operation, especially if you’re doing a cluster restart or you’re scaling your data sets substantially.

[00:27:35.640] – Will Xu

So the community came together to say, what if we find another mechanism for providing this level of consistent view without having to rely on a very tightly coupled architecture. So still providing cloud-flexibility, but reducing the compute costs for constructing the schema. And this is what the community came up with, which is a centralized data store schema. So instead of each broker asking the historical node what data you have on it, there’s now a process in the background, basically, that synchronizes the schema across all the historicals into the metadata store, and the coordinator is now in place for distributing this to the various brokers. This way we have a centralized, consistent view of what data is about on the cluster, and it also reduces the compute cost for the broker for constructing them, so it will make upgrades and scaling much faster.

[00:28:33.190] – Will Xu

And then this paves the way for Druid to start gathering statistics on how big the tables are, how big the segments are, doing processing, so that the broker can do smarter query plannings down the road. So there’s a lot of benefits for this. And I think the future direction is going to get to a place where the brokers are no longer required or having to construct any of the schema themselves, and then they can completely outsource this capability into the coordinator, so the entire cluster can be more cohesive while still being very loosely coupled. So it’s like cloud native.

[00:29:10.530] – Hugh Evans

One last thing. Let’s talk a little bit about SQL. So I know we’ve been seeing some moves in the package to more standard SQL, I guess, for queries. Can you tell me a little bit about TABLE APPEND syntax for UNION operations and how that’s bringing us closer to the SQL standard?

[00:29:28.080] – Will Xu

So if you look at Druid, Druid started off being this weird database that is both very schema-flexible, because you can’t really control the schema upstream, to supporting SQL, which requires a strict schema. So how do you support a strict schema where your underlying schema can actually change? This has been a challenge in the past. And what usually happens is people will have data coming in from various Kafka topics or various Kinesis streams, and then they’re loading into different tables. But when people are doing analysis across those tables, they’re like, Okay, let’s merge those tables together into a logical entity so that I can scan everything together and filter everything together. The problem with that is because SQL requires each table to have a schema defined, and because there’s no guarantee each event coming in from each stream have all the columns in them, this merging wasn’t possible before. So that means if you’re using the SQL native query engine, you can query across tables just fine, but if you’re in the SQL world, you don’t really have the ability to query across tables.

[00:30:42.320] – Will Xu

So the TABLE APPEND feature that we’ve introduced in this release provides the best of both worlds. It allows flexibility for it to ingest data without having to define the schema, but also gives the SQL engine enough information so that it allows you to merge querying across cross tables possible. So in this case, even if one of the tables are missing some columns, as long as it’s schema compatible, we will merge the columns automatically based on names so that when the query is running on a SQL layer, it will happily query across tables and make it really easy for it to process mixed data shapes in your cluster.

[00:31:23.310] – Reena Leone

So Will, jumping back in here, it wouldn’t be a release show after Druid 26 if we didn’t talk about arrays. So in Druid 30, what’s changed here and what’s been improved in this release?

[00:31:34.350] – Will Xu

Yeah, arrays is definitely another ANSI SQL standard capability. Before the existence of arrays, Druid supported MultiValue dimensions that allowed you to store things like instant grand post tags alongside with your data. That is limiting because it’s only string data. And some of the tags that we’re seeing from people are geo-coordinates or their temperature or other numerical values. And sometimes you even want to have empty or no values in your text, which is different from empty strings, which is actually not possible until today. So we have added all those capabilities into the Druid array type in the past few releases. And in this release, what has changed is we made it easier for you to build applications so that when you are using one of those SQL template features we call parameterized queries in the documentation, one of the parameters can now be an array of mixed value types. And then the engine will automatically figure out, Oh, now your array tag in this parameter is actually a number, it’s not a string anymore. So let’s match it as a number or match it as a no value. So it’s a lot more programmatically consistent.

[00:32:58.870] – Will Xu

You could have done this in previous releases after Druid 26, manually by constructing the query dynamically in your code. But this just makes it so much easier if you’re building this all as a REST API for the rest of your teams, because now they don’t have to worry about constructing the underlying SQL query. They can just hear some parameters with array values, and then the engine will figure it out to map it to the right data type on queries.

[00:33:29.680] – Reena Leone

Okay so speaking of things that we have mentioned before on this show, can you give me an update on window functions since it’s been such a hot topic and on many wish lists of many guests who have stopped by Tales at Scale, what’s going on with everyone’s favorite experimental feature?

[00:33:49.820] – Will Xu

Yeah, window function is such a fascinating feature. It’s very popular because it enables a lot of the use cases that people want to do, especially if you want to do any comparison of what happened in the past versus today, for example. It’s used in so many softwares for data analysis for process. And in the past, we have established a standard test for it. And this standard test to it basically allows us to monitor how close we are getting to essentially making the feature generally available. And then in the test you cover things like performance, functionality, correctness, etc. And we have been steadily improving the amount of pass rates for window function test cases. The goal is getting to 100%. Once we hit 100%, that means we have done everything that we know and we can make this feature generally available. I think we started from about 700 test cases passing, 690, 700 from two releases ago. Last release, I think we had about 700 passing all of a set of like a thousand test cases, roughly. And then this release, we are, I think, 2 test cases away from passing before we can say this is 100% ready for release.

[00:35:21.780] – Will Xu

So definitely, it’s really good for a lot of use cases already. There are some edge cases in this case. It’s covered in documentation- what’s not quite working. So we highly encourage people who need window functions to start trying it out and then see if there’s any problems or challenges that you’re facing, and we will be continuing pushing this forward. Our aim is, in Druid 31, we’ll be passing the 100% test cases, passing mark, and we can make this feature generally available.

[00:35:55.630] – Reena Leone

I made a joke when Peter Marshall was on this show for our 2023 wrap that when window functions goes GA, we’re getting T-shirts. Will, I’ll make sure that you have one as well.

[00:36:06.350] – Will Xu

Yeah, I would absolutely love some T-shirts.

[00:36:07.760] – Reena Leone

Will, he’s just going to say, window functions is GA. That’s it. Okay, Will, to close this out, Why should folks upgrade to the latest version of Druid if all of the things that we’ve mentioned on this episode didn’t convince them already?

[00:36:21.650] – Will Xu

I think if you look at each Druid release, what’s not being covered in those content, but deeply in the release notes are a lot of quality of life improvements and various bug fixes and stability improvements. If you’re not entirely convinced of what we have talked about today here, I would highly encourage you to take a closer look at the release notes. The community spends a lot of time polishing the release notes and adding in detailed content. So it should give you a pretty good idea of whether the new features and bug fixes and various stability improvements that we’re making are applicable to your needs or not.

[00:37:05.930] – Reena Leone

Well, it has been a pleasure to have you on this show again. And Hugh, thank you for being my co-host today. To learn more about Apache Druid, head over to druid.apache.org or the developer center at imply.io/developer. If you’re interested in learning more about what we’re up to at imply, please check out imply.io and if you’re interested in participating in Druid Summit 2024, please head over to druidsummit.org and get your abstract in for our call for speakers. Until next time, keep it real.

Let us help with your analytics apps

Request a Demo