A Year in Review: Apache Druid in 2023 with Peter Marshall

Feb 14, 2024
Reena Leone

As we wrap up the eventful year of 2023, it’s time to take a moment to reflect on the amazing journey we’ve had in the realm of analytics projects, particularly with Apache Druid. From major feature releases to significant enhancements, it’s been a great year for Druid – thanks to the dedication of the vibrant community behind it. The growth and transformation of Druid from where it was a year ago to where it stands now is truly remarkable and a testament to the collective effort of everyone involved.

One of the most exciting aspects of the year has been the enhancements and improvements to the multi-stage query framework (MSQ), which has opened up new possibilities in terms of data processing and analytics applications. The power of SQL-based ingestion, schemaless ingestion, and the ability to query from deep storage have revolutionized how data is handled. These advancements have paved the way for a more seamless and efficient data processing experience.

Looking ahead to the future of Druid, there will be a focus on SQL completeness, improved query execution, and enhanced integrations with other systems like Grafana and DBT. The community continues to grow and thrive, with more organizations and individuals contributing to the project, adding their unique perspectives and expertise to the mix.

As we gear up for the new year, stay connected with the Druid community:

Learn more

About the Guest

Peter Marshall is an award-winning speaker who leads community developer relations at Imply and is a long-time contributor of videos, tutorials, and documentation on Apache Druid. He has worked with adopters and users of Apache Druid for 5 years. He has over 20 years architecture experience and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham.

Transcript

[00:00:00.570] – Reena Leone

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, real-time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. This might be a longer episode because it is our 2023 wrap up. This was a big year for Tales at Scale since we just turned one. Yay. Happy birthday to us. But also for Apache Druid. Major feature releases happen this year along with improvements and enhancements, and that is all thanks to the dedication of the Druid community. Simply put, Druid today looks very different from Druid a year ago and way different than two years ago. Joining me to talk through the year that was 2023 is Peter Marshall, director of developer relations at Imply. Oh my God. Peter, welcome to the show.

[00:00:51.870] – Peter Marshall

Hi. Thank you very much for having me. Nice to be here.

[00:00:55.220] – Reena Leone

Well, since you are a first time guest, let’s dive into your origin story here. So tell me how you got to where you are today.

[00:01:03.210] – Peter Marshall

Oh my goodness. Okay, well, in the dim and misty 1990s I did a degree in theology and computer studies. I know. Strange. So I do have an art degree in a science, which is quite unusual.

[00:01:20.330] – Reena Leone

I actually do too!

[00:01:21.180] – Peter Marshall

Weird, right?

[00:01:25.930] – Reena Leone

I have a music business degree. That’s a bachelor of science.

[00:01:29.450] – Peter Marshall

Nice. Nice. So I started off working in a government body in London. I was there for eight or nine years and after that I became a solution architect working on the UK’s e-government program actually. So I was covering things like e-form systems and EDRMs and CRMs and call center systems, all sorts of things. And then I was plucked out of public sector, stuck with enterprise architecture, but then started to move a lot more into the data space. And I was really lucky then to have worked with and known the founders of a data consultancy in London and they introduced me to the big yellow elephant, except the variation with the r at the end. And that’s when I started to hear about these weird things called Spark and Kafka and other things made by the Apache Software Foundation and again helped along the way by good mentors, moved into a company out of Alpharetta in Georgia actually, who do master data management. And then in short order, that’s when I joined Imply almost five years ago Reena, it seems crazy, I can’t believe it.

[00:02:49.500] – Reena Leone

Wow…wow

[00:02:52.070] – Reena Leone

Okay, so before joining Imply, were you familiar with Druid or did that come after you joined?

[00:02:58.100] – Peter Marshall

Hell no. I had no idea what Druid was. I didn’t even really know what Kafka was. I’ve been very blessed in my career, right? I’ve got to thank so many people that helped me get… To join imply, but also along the way, so many people in IT. And when I joined Imply, I had to go through a very big learning process, right? Because I was trained in Oracle and I loved Microsoft SQL server, right? So someone coming along to me and saying, oh yeah, Druid is made of all these Java processes and it uses Zookeeper. And oh, by the way, we live up to the Apache way that was like, do I need to go on some kind of course to understand what language you’re speaking? All the functionality of Druid, like why it was scalable, why you’ve got redundancy, why it does massively paralyzed processing. I knew about this in theory from university, but actually using it and applying it was massively new. It was a lot of exposure, a lot of hard learning to go. Yeah.

[00:04:09.230] – Reena Leone

But I think that gives you a deep sense of empathy for what developers are going through, especially when they first start out with Druid.

[00:04:15.710] – Peter Marshall

Oh my. So like, I see people in the Slack channel who I suspect are in the same situation that I was four and a half years ago, maybe even six or seven years ago, who haven’t used these big data systems, like maybe even never heard of the Hadoop ecosystem at all. Had no idea. And so that’s one of those things that enabled me as a developer advocate back then to hopefully help people who aren’t working for Yahoo and aren’t working for these big companies like Netflix with massive engineering teams. And that’s why the developer team at Imply, developer advocacy team at Imply has grown the way it is because we like to have, I guess you would say, empathy for the people who are developers and developer advocates. That’s a massively important thing.

[00:05:15.990] – Reena Leone

Absolutely. So speaking of Druid and the beast that it is, so much happened, I feel like this year in Druid, we had several releases this year. Let’s kind of talk through some of the things that have happened. I would say probably like the biggest sort of carryover from the previous year was all of the enhancements and everything brought about by MSQ, our multi stage query framework that was added at the end of the previous year. But a lot has happened there. Let’s kind of talk through some of those things. What would you say was probably the biggest MSQ enabled thing that happened in 2023?

[00:05:57.360] – Peter Marshall

Well, let me just say this right. There are people out there listening to this who maybe have used Druid since. I know there’s some people out there who’ve used Druids since version one, right? I picked it up when it was Druid 15 and the community added the Druid console. So those of you who are out there who have used the Druid console, four and a half years ago, there wasn’t one, right? You can’t imagine what it was like without the console crafting JSON things in Windows notepad. But then 24 has come along and Gian Merlino talks about this a lot. He’s talked about MSQ. And I was kind of going, oh, well, that’s really interesting. I can do an insert statement in SQL. Okay, cool. I can see that. That’s cool. But I didn’t really have the vision that the PMC has about what Druid was going to turn into. I didn’t grasp it. And when 28 came along, right. 28 is kind of a realization of their dream of where Druid is going. And it’s so different to what Druid version one was like, being able to do SQL based ingestion and not having to prefetch all your data.

[00:07:09.560] – Peter Marshall

Querying from deep storage has been a massive thing this year. You can now query from deep storage. You can do that via MSQ. You want to keep your really more complicated queries away from the interactive queries. Fine. You can beef up the MSQ engine and separate those workloads down. It’s like this. Last year Druid has been given, there’s another gear in the engine, right? You can use this other gear for harder climbs, like for more difficult tracks. And yes, you can still press the sport button and you can still accelerate at the speed of my ridiculous Polestar two, which I own, by the way. Thanks for much. Yes. Cheers. You can still do that.

[00:07:51.540] – Reena Leone

Wow.

[00:07:52.550] – Peter Marshall

Oh my God. I am not sponsored by Polestar.

[00:07:57.850] – Reena Leone

We love car metaphors on the show. We love car metaphors!

[00:08:01.930] – Peter Marshall

Nice. So that’s a big thing. And if people haven’t tried MSQ or they don’t know what MSQ is about and what this has done for Druid, please go look. Right. Because it’s a real benefit to people who are currently using Druid or maybe haven’t thought about using Druid before. You should try this thing out. The other big area is definitely this move, not led by me, but by the community, right? That this move towards having much more compatible, standardized SQL. Now, it’s not just that. Oh yeah, great. We now support arrays. It’s not just that. It’s also about behavior. It’s about what happens inside Druid when somebody does something using the SQL language. And the biggest example of that being in 28 was the move towards SQL standard behavior around null handling, for example. So we’ve had this concerted effort to make Druid more powerful, more useful, and all the while also just keeping an eye on what people already know and how they expect databases to work. And it’s been such an exciting year, right? I mean, like ten years of Druid, what a year to have delivered on these things.

[00:09:21.410] – Reena Leone

We had Will Xu on the show to do the Druid 28 episode. That was the last episode that we did. And it was interesting the way we’re doing releases or the way that Druid is working where we’re thinking through how all the pieces fit together now. So it’s not just individual features or enhancements here and there, it’s how the whole system works together. Right? So when you talk about standardizing SQL a little bit more, when you talk about thinking through how we do arrays and not just being like we do arrays now, I think that is kind of a change in even how we’re viewing the Druid roadmap, right?

[00:09:59.700] – Peter Marshall

Yes, absolutely. If you look at the gamut of Apache projects that are out there, and there are a lot of them, Druid is one of those that’s been around for a long while now, right? And the approach that the community is taking to say releases or to testing or to thinking about how it responds to what the needs of users are is really maturing in a really good and exciting way. Right? It’s becoming a respected, stable, really valuable element of someone’s data architecture. When Druid one was released, the founders of Imply being part of this original development team are like full of energy and they still are, but full of energy to get this thing working. And now I feel like we’ve stepped up in maturity as a community around the Druid project. It’s becoming something that I hope CTOs are looking at and going quite seriously. Oh yeah, cool. Why don’t I use that instead of a, b or c? It’s a good secure project.

[00:11:04.470] – Reena Leone

So let’s kind of go through some of the updates for folks. We’re looking at it at a macro level. Let’s dive in. Love saying dive in. There were improvements to SQL based ingestion that happened this year. There was in database transformation that happened. There was schemaless ingestion or schema auto discovery. So you can have a schema or you don’t have to. The options are for you. So it’s flexible. You mentioned that cold tier and a query from deep storage is now finally available. And that’s been a wish list item for a lot of folks this year. I have talked to death about joins and now we have joins and Druid and joins and Druid is a thing and that it has been since like version 18. But you have different ways that you can do them from broadcast joins or shuffle joins. So I’m not going to talk about that as much because I have talked about that so much. But yes, you can do joins and you do shuffle joins, it ingest and a queries. And then again the separation of compute and storage. Are there any highlights for you this year that you think are particularly cool, I mean other than query from deep storage?Because that’s everyone’s favorite thing.

[00:12:22.500] – Peter Marshall

I think because it’s close to the work I’ve been doing with building out the learn-Druid GitHub repo with these Python notebooks. If anybody hasn’t tried it, go and try it out. Is the movement in Druid towards this more standardized SQL language? Right, and having a language which is more familiar with people who already know how to use SQL. And it’s not just that the movement is towards standardization of the SQL dialect like movement towards that. I’m sure Will’s probably spoken to everybody about this already,I’m sure. It’s also the behavioral aspect of it. It’s like when you do something with a null, this is what you expect the behavior to do. And in 28 that kind of behavior is a lot more embedded. And I can see that that’s the route that the PMC and all the committees and contributors are moving towards is like just having good SQL standardization. The other of course being like the query from deep storage aspect. For years the only way for you to be able to query your data in Druid was to use the prefetch, technically the coordinator distributing the data onto the historical servers ready for query.

[00:13:33.100] – Peter Marshall

But now you don’t have to do that, right? You can leave some of that data in deep storage and address it through the other workhorse in Druid. Now the MSQ engine, that’s massive because you’ve got another tier from which your data can be pulled. You don’t have to spin up 1000 million historicals to deal with your petabytes of data just because you want to query it. Now you can use the power of this segment format and the power of the parallelism in Druid to get at that. You just know that it’s going to take a bit longer. It’s not being prefetched. So these are two really important things that have happened this last year and come to maturity. And I would be wrong if I didn’t mention window functions, right? Everybody’s excited about the experimental window function. Remember this? And I’m really looking forward to that going GA. There’s colleagues of ours, Reena, who are working really hard in the community to get these window functions out for people. And that’s just going to open the door to so many people who will know me, who’ve spoken to me at conferences or seen me at meetups or whatever and gone, yeah, but can we do lag or leave please?

[00:14:45.130] – Peter Marshall

Well, this is the direction we’re going in. And yeah, long may this continue. It’s a said, it’s really exciting project.

[00:14:54.080] – Reena Leone

If we’re back at conferences, by the time that this [window functions] goes GA, we should just have shirts that say it. We should just have shirts that say it: “Window Functions are GA”

[00:15:04.450] – Peter Marshall

I love it.

[00:15:05.700] – Reena Leone

People can just walk up and be like, yes, thank you very much. We were waiting for that. Thank you.

[00:15:12.530] – Peter Marshall

I love it.

[00:15:14.230] – Reena Leone

Speaking of the community, let’s talk about what’s gone on in the Druid community a little bit, because none of this would happen without them. Let’s talk about how has the community grown. You’re working with folks day in and day out. What’s that looking like now?

[00:15:31.200] – Peter Marshall

So just take a look at the last six months, right? There were 100 organizations that made some sort of contribution to the Druid repo, and that could be code, but it’s also things like the start Druid script. It’s things like updates to documentation. These are people who are working at Amazon, at Confluent, at Cisco, at Oracle, at Salesforce, a systems integrator in India, a startup in Brazil. Right? And it doesn’t matter whether people work for a big company or a little startup or if they’re an independent person. That’s the joy of an Apache project. Everyone can get involved in it, whether it’s just answering a question someone has about how they should use Druid or whether it’s amending a typo on a docs page. Right? The way that these projects work, it’s why it’s beneficial, it’s why the Apache way is so important to me and to anyone who’s involved in these Apache projects.

[00:16:30.020] – Reena Leone

Yeah, I mean, if you’re in the Slack channel and you’re asking a question, you could have someone from Imply answer, someone from Netflix, someone from Apple, you never know, right? Because there’s so many folks working towards making Druid better. I think that’s kind of like the fun of it, right? Like everyone has this central point to kind of come together and collaborate. And that’s one of the reasons that I went back to open source and working for a company based on an open source technology was that type of community.

[00:17:02.310] – Peter Marshall

Yes.

[00:17:03.220] – Reena Leone

Actually, speaking of cool people from cool organizations, I’m going to talk about the show for a little bit because we were fortunate enough to have so many guests join us on this undertaking that I have been working on for the last year. It’s been fantastic. And to your point, we’ve had folks from the open source community, we’ve had folks who are Imply customers, we’ve had folks from startups, we’ve had folks from larger companies all come and talk about what they’re doing with Druid and their use cases and been so cool to share them with me. The first one was Gwen Shapira, who’s the co founder and CPO of Nile, was talking me through real time data when you need it, when you don’t and how you utilize that data to build trust with your customers was a fantastic episode to kind of kick off my first external guests. Lakshmi from Reddit was on the show and it was fantastic to dive into how Reddit is powering their ad platform with Druid. We talked a lot about Kubernetes this year because Kubernetes in Druid is a huge thing. So Yoav Nordmann from Tikal was here to talk about that, as well as Adheip Singh, who has founded his own company, Data Infra.

[00:18:18.890] – Reena Leone

And I think we actually did that episode before the company officially launched. It was like in stealth mode. So that felt like a fun preview. We had Josh Patterson from Voltron Data really talking back to your point about open standards and then diving into the greater Apache ecosystem with Apache Arrow, which was a little bit of diversifying from just talking about Druid. And then probably the biggest one I think for us because of our partnership with Confluent was when Kai Werner came on the show to talk about Kafka, Druid and Flink, which are three technologies that really work together pretty seamlessly, especially in the era of streaming. Jaylen Stoez came from Ibotta to talk about how they do fraud detection using Druid, which was super cool. And then of course, I feel like Atlassian has been like the heroes of this last half because they’ve done so much with us, talking about how they use Druid with their Confluence product, how their Big Data Platform Team is helping to do customer facing analytics using Druid. So I had Kasi and Gautam, who have spoken at Druid Summit before on the show. So many fantastic people.

[00:19:34.460] – Reena Leone

Oh, yeah. And then how can I forget Digital Turbine talking again about how they’re using Druid to power what they’re doing in the ad tech space. Who am I missing? Did I miss anybody, Peter? And then beyond just the folks that we work with at Imply, I’m trying to think. I feel like there are just so many folks that joined the show this year.

[00:19:56.400] – Peter Marshall

It’s really wonderful, right? Because this is like a place where if you’re in the Druid community, you can come and hear other people in the Druid community and find out who they are and what they’re doing with it. Right? So congratulations to you, Reena. Really good year..

[00:20:12.310] – Reena Leone

 And we’re going to make the next year even better. I’m so excited. I’m so excited for what’s to come so let’s, let’s get into that a little bit, right? Okay, let’s talk future. I’ve got my tarot cards out. No one can see. I’m predicting what will happen in Druid 29. Druid 30. What is on the horizon for Druid? I’m just kidding. I’m just kidding.

[00:20:34.690] – Peter Marshall

Well, I’ll tell you the things that I’m excited about, and one of them is window functions. Obviously we can’t run away from window functions, but people who maybe are following Hellmar Becker’s blog, right? Recently he did one on updates. Now Imply, that’s a big discussion we’re having about what contributions we can make around proper SQL completeness on things like update, delete and merge. Right. The kind of functionality that you can get in druid with some kind of like, if you know what you’re doing kind of approaches. But to have that in the SQL dialect itself will just be awesome, right? That’s going to be brilliant. Something that we’re working on in the developer advocacy part of Imply is connecting better with people and technologically with other systems that people use. So things like Grafana or DBT or other projects that we see people connecting to in a Druid based data architecture, getting those integrations and getting those connections. So if you’re in one of those communities, then please come and speak to me. We’d like to work with you. I’m also hearing about stuff that people are working on not just at Imply, but in other places too, to really improve the way that queries execute.

[00:21:58.910] – Peter Marshall

So doing shuffle operations in the MSQ engine if you know that these stages talk to each other, how do we do that in memory? How do we do that without dropping into S3 or something? How do we do that? How do we only get a particular part of a segment that we need when a query executes? How do we use caching more effectively? Right. All of this again, is because we’re all driving towards a database where performance is the thing that everyone cares about, is performance that we want to get and eke out of this database as much as possible. So I’m super excited about the next year and this rising level of activity across the world in people contributing to and getting involved in the Druid project. It’s going to be a good year.

[00:22:47.340] – Reena Leone

I’m so excited. And you know what, Peter? Speaking of people getting involved in the Druid project, say they’re our first time listener to this show. How would they get involved? How can they start to learn Druid? How can they start to figure out some of these things that we’ve mentioned? If they have used a past version of Druid, where do they go?

[00:23:08.590] – Peter Marshall

So if you’re like, oh, I like the sound of Peter Marshall’s voice, I’d like to listen to him some more. But it’s not just me. There’s also Sergio Ferragut and others who people may know of. And we do open source tech talks, right? So these aren’t like, oh, we’re going to try and sell you something. This is just us sharing with you our knowledge of Druid and how it works. And it could just be really helpful as like a basic intro to how Druid is built, where it came from, and the kind of use cases it gets applied to. You want one of those like the imply.io website has that on there. You can go and sign up and come to one of these either ad hoc or monthly. Right? That’s the first thing that I suggest people do. If you’re kind of like, nah, I just want to get on with it. Then definitely go and have a look at the GitHub repo. It’s called Learn-Druid in the Imply data organization on GitHub. So https://github.com/implydata/learn-druid. And that has a really good number of Python notebooks backed by a Docker container with all the kind of components you need to learn how Druid works, right?

[00:24:18.770] – Peter Marshall

You can just go and play with it and you can do some really cool things like work out how Apache data sketches can help you do approximate distinct count operations, which not a lot of people know about. So go and have a look at that as well. Of course, on the developer center imply.io, there are a ton of videos and articles and really cool things you can do built around Druid. So definitely go and have a look at imply.io/developer. There’s some really cool content in there. If you’re interested in learning more about what we’re up to on the Imply side, go and have a look at the Imply blog. So Will Xu’s Druid 28 release blog is in there. And if you are someone who hasn’t used Druid for a while and you don’t want MSQ is, make sure you read Gian’s blog post on the future Druid. It’s from a few months ago now, but it’s really good introduction to what MSQ does.

[00:25:12.280] – Reena Leone

A new shape for Druid. I know it well. I know it so well. That’s literally what it’s called.

[00:25:20.680] – Peter Marshall

Cool. That’s where I would say if you’re a Druid user already, go and join these meetup groups. So meetup.com. Just search for Druid and you’ll find not just our meetups, but other meetups too. And yeah, look out for events coming throughout the year next year. We’re all very excited to be kicking off this year with getting people together and eating pizza. Who knows?

[00:25:42.950] – Reena Leone

Oh my gosh, I’m so excited. And I will hopefully be at some of these as well. So if my voice sounds familiar, it’s probably me. Peter, you’ve covered everything. I don’t even have to do my usual sign off. I mean, other than if you want to learn more about Apache Druid, visit druid apache.org.

[00:26:05.350] – Peter Marshall

Awesome. Great minds think alike.

[00:26:09.250] – Reena Leone

And you know, if you want to learn more about what we do, visit imply.io. Peter, thank you so much for doing this episode with me. I am so happy that you’re here and I’m so excited for the new year. So all I have left to say is, until next time, keep it real.

Let us help with your analytics apps

Request a Demo