Druid and Joins Debunked! with Sergio Ferragut and Hellmar Becker
There’s a persistent myth surrounding Apache Druid’s capabilities when it comes to handling joins. This misconception has created confusion and led to some skepticism about the platform’s abilities. On this episode, Sergio Ferragut, a senior developer advocate at Imply, and Hellmar Becker, a senior sales engineer at Imply debunk the myth and shed light on the reality of Druid’s join capabilities.
While Druid isn’t a traditional relational database and doesn’t handle table joins in the same way as SQL joins, it does have the ability to perform joins using different methods, such as lookups and shuffle joins. There are nuances to how Druid handles joins and while it may not be optimized for large fact-to-fact joins at query time, it offers alternative approaches.
Druid’s approach to joins may require a different mindset and modeling work, but it ultimately offers valuable capabilities for addressing complex data analytics challenges.
Listen to the episode to learn:
- How Druid processes the aggregation and the filtering in the historicals, returns the data back to the brokers and can process joins in the broker
- Why Druid made a decision to optimize the use case where the joins are pre done, so the data is pre flattened so that we can solve queries, large queries in parallel
- How organizations, such as a big telco operator, have successfully utilized Druid for handling joins at scale.
Learn more
- Joins in Apache Druid
- Apache Druid 26.0: Breaking Down Druid’s Latest Release
- Druid Architecture & Concepts
About the Guests
Sergio Ferragut is a database veteran turned Developer Advocate. His experience includes 16 years at Teradata in professional services and engineering roles. He has direct experience in building analytics applications spanning the retail, supply chain, pricing optimization and IoT spaces. Sergio has worked at multiple technology start-ups including APL and Splice Machine, where he helped guide product design and field messaging. He joined Imply as a Developer Advocate in 2021 to help and to learn from the Apache Druid open source community.
Hellmar Becker is a Senior Developer Advocate at Imply. He has worked in data analytics for more than 20 years in various pre- and post-sales roles. Hellmar has worked with large customers in the finance, telco and retail industries, and spent several years at big data company, Hortonworks, and recently at Confluent.
Transcript
[00:00:00.490] – Reena Leone
Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid real time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level.
[00:00:18.240] – Reena Leone
Now, we know that I have a definite bias towards Druid, but I’m realistic about what it’s meant for and what it’s not meant for. Every technology has its key use cases and places where it’s a good fit and where it’s not a great fit, right? But there are some misconceptions about Druid’s capabilities.
[00:00:34.420] – Reena Leone
I’ll actually be joined by today’s guest at Druid Summit this year to talk about them, but stay tuned at the end of the show for more info on that.
[00:00:41.920] – Reena Leone
Back to today’s episode. We are debunking a pretty persistent Druid myth, one that keeps coming up. Druid can’t do joins. And sure, Druid isn’t a relational database, so table joins don’t quite work like traditional SQL joins. That’s because Druid was designed for fast queries on large scale datasets, and it primarily focuses on aggregating and filtering data across dimensions and measures.
[00:01:04.630] – Reena Leone
But that doesn’t mean that it can’t do any joins. It just does them differently. To get to the bottom of this, I’m joined by returning guests Sergio Ferragut, senior developer advocate at Imply, and Hellmar Becker, senior sales engineer at Imply, and they are myth busting team for today. Sergio Hellmar, welcome back to the show.
[00:01:25.480] – Sergio Ferragut
Hey Reena, thanks for having me back.
[00:01:27.760] – Hellmar Becker
Hey, Reena.
[00:01:28.720] – Reena Leone
So since you’re all my first repeat guests, we’ve already kind of dug into your backgrounds and experience with Apache Druid. You know your stuff. So I wanted you to come back and help me out with this nasty little rumor I keep hearing about. I’d even go as far as to say it’s like database trash talk that Druid just can’t do joins.
[00:01:47.270] – Sergio Ferragut
Yeah, that’s not right. So Druid can do joins. Druid actually does joins in two different ways. Druid has a concept of lookups which allow you to do fast joins in memory joins. Well, we’ll get into the explanation, and it also newly does shuffle joins, but we’ll get into the shuffle joins in a second. But normally it solves any joins in the broker. So it processes the aggregation and the filtering in the historicals, returns the data back to the brokers and can process joins in the broker. So Druid can do joins.
[00:02:27.700] – Reena Leone
Okay, so Sergio, then why does this rumor sort of persist that it can’t?
[00:02:33.730] – Sergio Ferragut
Well, I guess it’s because it’s doing the joins in the broker. Right. So what that means is it’s parallelizing all of the aggregation and filtering that happens in the data layer, right? In the historicals and in the streaming tasks, all of that occurs in parallel. They each return a subset of the results, a partial result back to the broker. And that’s where the broker can take multiple sub queries or multiple data sets and do the joins. Problem is, it’s a single thread. So yes, if you return a lot of data to the broker, it’ll take a while to solve that join. Right? So yeah, it’s not meant for large fact to fact joins at query time.
[00:03:16.380] – Reena Leone
Well, I guess that kind of goes back to what is the purpose of Druid, right? It’s not meant for certain things. It’s meant to be fast.
[00:03:28.770] – Sergio Ferragut
Right. Data warehouses and data lakes are very much designed to do large joins, right at query time. And because they can do that, they can solve a larger variety of fact to fact joins, but they’re not going to respond in subsecond response times once you have certain large data sets, if you have to do that kind of join, right? So yeah, Druid made a decision to optimize the use case where the joins are pre done, so the data is pre flattened so that we can solve queries, large queries in parallel, in subsecond response times with ad hoc aggregation. But yeah, without fact to fact joins.
[00:04:12.860] – Reena Leone
So maybe to fully squash this misconception we should go back to where maybe it started. So broadcast joins weren’t introduced until Druid 14, I believe. But with those there was still a limit on the scale of the data that you can join. So it was broadcast joins versus, say, lookups. Can we talk a little bit about those?
[00:04:33.750] – Hellmar Becker
Yeah, maybe I can take this. Lookups are. Well, first of all, Sergio just said it, a standard Druid join is good for joining one large set of data against a smaller one, like fact versus dimension, but it happens on the query node, on the broker and we’ll come to why that is in a moment. So this kind of joining data is quite performant, but it’s all processed in a single thread, so it cannot really take advantage of parallelism. So what was introduced in order to alleviate that problem was broadcasts via so called lookups. So lookups are data sets and you typically want to keep them small, they are kept in memory on every single data server. So each data server has a copy of that lookup table. So imagine you have something like a country list or an address list. So your typically small warehouse dimension would live in such a lookup table. And these are blazing fast. But then again, at the expense of permanently keeping a copy of each of these data bits on all the nodes.
[00:05:51.640] – Reena Leone
Well, maybe, I know we’ve introduced since in May we had Druid 26 was released, and that included shuffle joins. So, Hellmar, can you talk to me a little bit about what are shuffle joins?
[00:06:03.260] – Hellmar Becker
Yeah, so, as we already hinted, there are still a lot of use cases where you do want a join where both sides of the joins are too large to fit on a single server, and also too large that you can send a copy of one of the tables to all of the servers, because it’s just not going to fit. So what you want to do with a shuffle join is you shuffle the input data in such a way that the bits that need to be joined together are cut up into packages, and each of these packages is sent to one server to process. So it’s like MapReduce, where you actually have map shuffle reduce. And that is what is now implemented in Druid. And that lets us do much bigger joins.
[00:06:51.980] – Reena Leone
But still, they aren’t subsecond at a certain size, right? It goes back to kind of the purpose of Druid.
[00:06:59.790] – Hellmar Becker
It goes back to the purpose of Druid, and it also goes back to the fact that you can’t beat physics. Druid, as you probably know, Reena Druid, was deliberately designed to use a scatter gather query pattern, where you assume that you are able to cut up your query workload into packages that can be processed each by one server without communicating with anybody else. And that works only if you have one big table. It works if all data that one subtask needs is available on the same server. And in order to be able to process general joins, you need to introduce a shuffle phase that puts the right data pieces together. And that comes at a cost. Always. Every data based system that allows these generalized joins has to pay that price.
[00:08:01.520] – Reena Leone
Yeah, especially the larger the data set, the more difficult it becomes.
[00:08:06.240] – Hellmar Becker
That’s right.
[00:08:07.000] – Reena Leone
But all this hullabaloo about joins, there are folks who are kind of circumventing them entirely with unions. Sergio, I think you mentioned Target is doing that, right?
[00:08:18.370] – Sergio Ferragut
Yeah, we should link that video in the description here. But that’s the Target Druid Summit presentation that occurred, I believe, in 2021. The guy from Target described what they were doing and they prepare the table such that they have the same set of columns, aggregation columns, even though they may be talking about different things. Like some is sales data and some is sales forecast data and others may be inventory data, but they have the same set of columns that they’re aggregating on when they ingest them. So that when they can query, when they query them, they can query them independently and use a union to bring the results back together. So it’s a pseudo join, it’s not really joining the data, but it does bring it back together in a single result set and can be used for different metrics from different sources of data. And that’s what Target is doing. So it’s an interesting way of getting around the lack of fact to fact join at query time.
[00:09:23.490] – Reena Leone
It’s Target. I imagine they have a lot of data that they’re dealing with all the time.
[00:09:28.180] – Sergio Ferragut
Right. And the other thing we should mention, right, is with the introduction of shuffle join, it allows you to do that preparation of the data ahead of time, right. Because Druid can now do the shuffle join, right. And you can issue a select to Druid that naturally uses the shuffle join, but that is not going to be a subsecond query. Right. I think we’ve established that. So the way we’re seeing implementations use Druid shuffle join is to use it for ingestion. So they ingest from multiple sources, potentially multiple external sources. So they may be transforming some data that they’ve already loaded into Druid. But the point is you can do that shuffle join at ingestion time. You do a batch of processing such that you do those joins ahead of time, so that when you do the queries you do get a scattered gather and you have all of the dimensions available to do ad hoc analysis on aggregating and filtering on all those dimensions.
[00:10:27.090] – Reena Leone
You’re saying we just have to do things a little bit differently. Not that it can’t, it’s just a little bit different?
[00:10:32.942] – Sergio Ferragut
Correct.
[00:10:33.050] – Hellmar Becker
Yeah, basically, I think what you’re building then is like a materialized view, right?
[00:10:38.130] – Sergio Ferragut
Essentially, yeah.
[00:10:39.410] – Reena Leone
Very interesting. But yeah, I think that’s one thing to consider is why are you using Druid? What are you trying to accomplish in the first place? And there are always going to be trade offs. However, maybe this is a little bit of a heated question, but I’m all about transparency. How much slower would it be? It’s not going to be like a snail’s pace, right?
[00:11:04.790] – Sergio Ferragut
Well, it depends, right, Hellmar, you want to give a shot at. It’s going to be an order of magnitude probably at least.
[00:11:14.250] – Hellmar Becker
Well, we are working with this big telco operator that is using Druid at scale and I think they are maybe not quite subsecond when they really, I mean, they really join all their network data and their mobile data together and they still achieve a query performance in the range of seconds.
[00:11:31.490] – Sergio Ferragut
While doing the shuffle join?
[00:11:34.890] – Hellmar Becker
Yes.
[00:11:35.950] – Sergio Ferragut
And this is in Druid?
[00:11:38.200] – Hellmar Becker
In Druid, yeah.
[00:11:39.660] – Sergio Ferragut
Okay, that’s interesting.
[00:11:42.910] – Hellmar Becker
Actually. No, they don’t do a shuffle join. They do the join on the broker.
[00:11:48.290] – Sergio Ferragut
Okay. Right.
[00:11:49.490] – Hellmar Becker
So they have pretty beefy brokers and that is what they decided to do.
[00:11:54.420] – Reena Leone
Wait, did you say beefy brokers? Because I love that that’s going to be another thing from this.
[00:12:02.790] – Sergio Ferragut
But yeah, you can do that. Right. You’re absolutely right. Hellmar. Let’s make sure that people understand that you can do these joins on the broker and you don’t necessarily need an incredibly beefy broker. It helps to have a beefy broker for sure. But the point is you design your queries such that they’re not returning billions of rows to the broker to do the join. They’re returning maybe hundreds of thousands. Right. And you could still do something fast. But yes, the join on the broker is a bottleneck that you need to design around.
[00:12:41.560] – Hellmar Becker
Right.
[00:12:42.000] – Sergio Ferragut
You need to think about it.
[00:12:43.640] – Hellmar Becker
There is a lot of considerations that come into play. If you come from a more traditional background, you think in terms of a relational database and then you will find yourself often to rethink a few things and then all of a sudden it becomes easy in Druid. Just don’t stick to your old terms.
[00:13:06.880] – Reena Leone
Right. And you shouldn’t probably do that with open source tech anyway because things are always changing. Or if you’re working with us at Imply, there’s always new versions. The community is always working to improve things, make them as fast as possible. So if you’re still doing the things the old way, then that’s on you. Not to call anybody out, but that’s kind of on you.
[00:13:33.410] – Sergio Ferragut
So yeah, I guess one of the things we can conclude on is, yes, the Druid does require that you, like Hellmar says, think a little different and that you do the modeling work. There is some modeling work to do and you need to think about what kinds of patterns of queries you want to solve and with what response times, and you build the data structures in Druid that will support those.
[00:13:57.370] – Hellmar Becker
Well, come to think of it, Sergio, you also have this rich experience in RDBMs. Right. And what you often don’t think about is how much mental work goes into designing your relational data model to do proper joins. Because you spent a lot of time figuring out how to partition your data, which kind of indexes to put on both sides of the join or to make them really perform well, because otherwise your relational database is also going to suck at joins.
[00:14:27.440] – Sergio Ferragut
Right, right.
[00:14:28.680] – Hellmar Becker
It’s just that because the ways that you think out these things in Druid are new sometimes makes them stand out more. But in the end, these are cycles that you need to put into proper planning with any system.
[00:14:43.530] – Reena Leone
I love a good planning cycle, let me tell you. Okay, so to conclude, Druid can do joins, it just does them a little bit differently.
[00:14:54.360] – Sergio Ferragut
Absolutely.
[00:14:55.330] – Reena Leone
All right, I think we squashed that, guys. I think we have it. This is here. So if anybody has any questions, we can just point them in this direction. Hellmar, Sergio, thank you again for joining me on today’s show. As I mentioned in the intro, we’ll be doing a whole session Demystifying Druid Myths at Druid Summit 2023, taking place on December 6. We’ll actually show you how Druid does joins, updates and more. Registration is now open on druidsummit.org. In the meantime, if you want to learn more about Apache Druid, please visit druid.apache.org. And if you want to learn more about what we’re all doing here at imply, please visit imply.io. Until next time, keep it real.