Everything You Need to Know About SQL-Based Ingestion in Apache Druid

Mar 14, 2023
Reena Leone
 

The addition of the multi-stage query framework in Apache Druid 24.0 enabled several other key features – one of them being SQL-based ingestion. By embracing SQL-based ingestion, Apache Druid becomes more accessible and user-friendly, as it aligns with the skills and knowledge of a broader community of developers. This simplifies the usage of Druid, empowering developers to leverage their existing SQL expertise while benefiting from Druid’s powerful capabilities.

By leveraging the familiar SQL language, users can easily define ingestion tasks by specifying the desired data sources, transformations, and schema mappings. This flexible process enables the ingestion of data from various databases, streaming platforms, and file formats. With SQL-based ingestion, Apache Druid empowers developers to effortlessly load, transform, and optimize their data for fast querying and analytics. By embracing the versatility and efficiency of SQL, Apache Druid simplifies the data ingestion workflow, allowing users to focus on extracting valuable insights from their datasets.

Listen to the episode to learn:

  • How SQL-based ingestion works specifically in Apache Druid
  • How partitioning works in Druid and how it related to SQL-based ingestion
  • The benefits of using SQL-based ingestion in Druid
  • How to get set up if you’re new to Druid

Learn more

About the guest

Sergio Ferragut is a database veteran turned Developer Advocate. His experience includes 16 years at Teradata in professional services and engineering roles. He has direct experience in building analytic applications spanning the retail, supply chain, pricing optimization and IoT spaces. Sergio has worked at multiple technology start-ups including APL and Splice Machine where he helped guide product design and field messaging. He joined Imply as a Developer Advocate in 2021 to help and learn from the Apache Druid open source community

Transcript

[00:00:00.730] – Reena Leone

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host Reena from Imply and I’m here to bring you stories from developers doing cool things with analytics way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. Today we’re talking about the relatively new SQL-based ingestion in Apache Druid. How does this change the game for developers working with Druid? How does it work or how does it differ from SQL-based ingestion and other databases? Why was it added and what’s coming in the future? To answer these questions, I’m joined by Senior Developer Advocate at Imply, Sergio Ferragut. Sergio, welcome to the show.

[00:00:38.250] – Sergio Ferragut

Hi.

[00:00:38.830] – Reena Leone

So you are a developer advocate at Imply and if someone is attending a Druid meet up, they’ll likely meet you. Can you tell me a little bit about your journey to get to the spot?

[00:00:47.610] – Sergio Ferragut

Sure. Well, I come from a background in databases. I started a career in data warehousing with Teradata back in the spent quite a bit of time there implementing data warehouses and working with engineering on high availability side of Teradata. From there I moved to a few startups. This is my third startup since and I love the startup environment and the energy. So I’m enjoying my time at Imply here.

[00:01:19.130] – Reena Leone

Awesome. So you have a lot of experience with databases. How did you hear about Apache Druid? Or was that after you came to Imply?

[00:01:26.850] – Sergio Ferragut

It was actually after. Well, after I came to Imply for sure I had not heard of Druid. So it’s interesting. My job is now interesting because I actually have to promote Druid and let more people know about Druid. That’s sort of what a developer advocate does.

[00:01:43.620] – Reena Leone

It’s been really interesting. I’ve been asking a lot of guests how they’ve heard about Druid and I feel like everybody has a different story and most of them seems to be like a random happenstance. But let’s just get right into the topic of today, which is sequel based ingestion. Now, SQL-based ingestion as a concept isn’t new by any means, but it’s new to Druid. The work around the new Multi-stage Query Engine actually helped enable this feature. So can you tell me a little bit more about it?

[00:02:11.050] – Sergio Ferragut

Sure. Let’s start with multi stage query framework. So Druid before Msq was a scatter gather approach for queries. This means that it had a broker service that distributed the query into sub queries that it sent to multiple nodes, processed them in parallel, got results from each one, merged the results and returned them back to the user. So it’s a two stage process, right. The historicals resolve the bulk of the calculation, and the broker does the finalization to return the results. With multi-stage query engine. Things change. It enables the ability for, like its name says, multiple stages in processing results. So it enables a different kind of processing each step in each stage and processing has different tasks to do, and it has the ability to reorganize the data between stages, and that enables many other functionality that Druid didn’t have before this.

[00:03:17.820] – Reena Leone

Yeah, we talked a little bit about the broker being a bottleneck on a couple of episodes ago. We were talking about what’s new, but I know that SQL based ingestion was a really important thing and actually in making Druid a little more simple and easier to use. Can you explain that to me a little bit?

[00:03:36.220] – Sergio Ferragut

Sure. Prior to SQL-based ingestion, all ingestion was submitted to Druid through JSON spec. So this is complex nested structure that has lots of different sections and attributes that is quite frankly, not very friendly to code, right. And to adjust and to tune. So the idea of SQL-based ingestion is to make it a lot easier. Right. It’s essentially writing insert or replace statements that followed by select, and almost everybody knows SQL. So defining the transformations and the ingestion just through SQL makes it a lot easier to do batch ingestion.

[00:04:23.280] – Reena Leone

So what types of ingestion like SQL statements can you do now in Druid? 

[00:04:32.250] – Sergio Ferragut

Based ingestion, you can do inserts and replaces. So you do inserts to append data to a table, and you can do a replace to replace a section of data or to rewrite a whole table. Those are the at the high level, the statements that you can do now, they’re followed by a select. Right. You’re inserting something that you’re transforming, and the select is really where you’re defining what it’s going to do. You can do straight selects that are selecting time and attributes and a bunch of columns from external sources. That’s the other aspect of it that’s interesting. The from clause has an external table function capability that allows you to access external sources, the external sources of data files like Blob storage or external databases where you can submit SQL. So the select statement allows you to build, like I said, straight select or aggregation query. So at the time that you’re ingesting the data, you can ingest the raw data, raw events and transform it and transform them with SQL expressions as they’re coming in, and alternatively aggregate the data in some form as it’s being ingested so that it’s prepared for efficient queries once it’s ingested.

[00:05:49.110] – Reena Leone

And we’re talking about batch ingestion here, correct?

[00:05:52.270] – Sergio Ferragut

Correct.

[00:05:52.980] – Reena Leone

Okay, awesome. Since I have you on the show, maybe I should take a minute to talk about partitioning, right. Since I feel like this is an area that you have a lot of expertise in and that you have talked about a bit, and I know this kind of ties in, can you give me kind of a brief overview of partitioning in Druid?

[00:06:10.430] – Sergio Ferragut

Sure. Let’s step back and say there’s a distinction between streaming ingestion and batch ingestion. Right. With streaming ingestion, you cannot do it through SQL yet we’re working on enabling that. So you do have to specify the JSON spec that describes what topic it’s reading from, how many tasks it’s going to use, or how parallel it is and how it’s ingested. In that case, because of how real time ingestion works, the partitions are really not partitioned. By partitioning we mean organizing the data right in a specific way. So real-time ingestion brings the data in as it comes in through the partitions and does build segments out of them and organizes them in time. Now let’s go back to batch ingestion. With batch ingestion, because you have the ability to reorganize the data in batch, because you have the whole data set that you’re going to ingest, you can organize it in two different ways. You can organize it in time and SQL-based ingestion, it’s really just specifying your insert, your select, and then after that a partition by clause. And that just tells it how to partition time. You can do it by hour, you could do it by week, by day, by month.

[00:07:35.190] – Sergio Ferragut

Most typical implementations and the recommended granularities for that are typically hour or day. But it depends a lot on your density of the data. The second level of organizing the data is after you’ve separated the data into blocks of time, is within each block of time you can do secondary partitioning. And that’s what I believe you’re referring to. And that allows you to do in traditional batch ingestion, you could do hash partitioning. This takes the values of certain columns that you select and hashes them into a value and then assigns them into hash buckets. Right? So different partitions trying to create partitions that are all relatively the same size and all relatively around 5 million rows or so. So that organizes the data by hash value. When you query that data using the dimensions that it was hashed on, you can prune the data further and just select individual partitions that contain the particular values you’re looking for. Now, with SQL based ingestion we use a specific kind of partitioning, it’s called range partitioning, which is also available in native ingestion. But the range partitioning turns out to be the best method almost always. So that’s why it was selected as the default partitioning mechanism for batch ingestion with SQL.

[00:08:59.050] – Sergio Ferragut

So you specify it by using a cluster by clause in the select. So again it’s the insert followed by select, followed by partition by, followed by a cluster by where you specify which dimensions you want to cluster by. And what that does is it takes the cluster by dimensions, it looks at the overall distribution of values within that dimension and it partitions the range in such a way that it builds segments that are of approximately equal size and approximately a few million rows that you’re targeting for each segment. So it does that automatically. And again, it’s at query time that enables fast queries because we know what range of values exist within each segment file that’s part of the metadata. And when the query comes in, we can prune the segments and select which specific segments are interesting to query and therefore accelerate the queries.

[00:09:56.190] – Reena Leone

Yeah, because I was seeing something like using this. Now using SQL based ingestion makes it 40% faster with batch ingestion. And that’s what we would recommend. Now if you’re doing batch ingestion.

[00:10:10.310] – Sergio Ferragut

Druid right, batch ingestion not only has all those capabilities of making things simple in SQL, but it’s also more efficient than the native batch ingestion. And I assume mileage will vary based on the data demographics and how specifically it needs to resolve a particular ingestion. But yeah, we’ve seen around 40% performance improvement in many batch ingestions.

[00:10:36.310] – Reena Leone

Can you walk me through the process of setting this up in Apache Druid or has it changed from the way that you would typically do ingestion?

[00:10:45.080] – Sergio Ferragut

Actually, in Apache Druid 24, I believe it was introduced, and Apache Druid 25, it became default.

[00:10:55.330] – Reena Leone

We talk about it like it was so long ago. That’s happened fairly recently, a few months ago.

[00:11:03.330] – Sergio Ferragut

But anyway, it’s become a default configuration. So you really don’t need to worry about it unless you’re upgrading from a previous release. And that’s really just an extension that gets added to the list of extensions that are loaded, which is the good multi stage query extension.

[00:11:22.080] – Reena Leone

Let’s get into maybe like the next level down in terms of SQL based ingestion. So one of the questions that I came up with was what is the syntax for creating tables and defining columns? Can you talk a little bit about that?

[00:11:38.780] – Sergio Ferragut

So in the past with native ingestion, you had to individually define everything. Right now it’s really just the content of your select, right, which is really interesting. So just to take that from the beginning, if you do an insert for the first time or a replace for the first time, that table doesn’t exist yet. And you just follow with the select part where you do all of the results that you want to create into that table that actually defines the schema of the table. You always have to select time. Time is the primary partitioning, so you select some transformation of your input time, and then you select a bunch of attributes with whatever transformations you want. You can do aggregation on the sequel, but the resulting set of columns from your select is really what defines the schema of the table, regardless of whether you’re doing straight ingestion or aggregate ingestion.

[00:12:34.790] – Reena Leone

And then can you also explain this concept of virtual columns?

[00:12:40.690] – Sergio Ferragut

In the past, in the native ingestion, this concept of virtual columns was the idea that you have some input columns that you may not be ingesting into your final table. You’re transforming them in some way, creating some expression, and that is considered a virtual column that then you are ingesting into a physical column. In Druid with SQL, the concept is a little less clear, right? Because you don’t have to actually do anything different. Every column that you’re selecting is essentially an expression. It could just be the column, but that’s also a valid sequel expression. Or it could be a complex expression with case statements and calculations and stuff, and that becomes essentially a virtual column. It’s a transformation that you’re doing as you’re ingesting the data in SQL, it just becomes natural.

[00:13:32.760] – Reena Leone

So now we’ve kind of shifted, and we mentioned before that this is like the most recent version of Druid. You’d have to upgrade to the most recent version to get all this. Should everybody do this to use SQL based ingestion? Are there certain use cases where this would make more sense versus the native ingestion that we had before? Would you recommend everybody just go this way for their batch processing?

[00:13:58.970] – Sergio Ferragut

Yeah, just to the improvements in ease of coding and ease and in speed of the ingestion itself, and control of the parallelization of it is a lot simpler with sequel based ingestion than it was with native ingestion. So, yeah, I think at this point the recommendation is just shift over to SQL based ingestion. If you have the time to convert your previous ingestions, I think you’ll get a boost in performance. So that’s a good thing. And just maintaining SQL is way easier than maintaining the Json specs.

[00:14:34.080] – Reena Leone

And then I know we talked about it briefly, but can you tell me anything else about the streaming ingestion that’s coming?

[00:14:42.830] – Sergio Ferragut

Well, it’s actually not well defined at this point in terms of how we’re.

[00:14:46.670] – Reena Leone

Going to I’m trying to get some inside knowledge, by the way, because I don’t know any further than that. So I’m like, do you have the inside track?

[00:14:54.300] – Sergio Ferragut

I have my hopes for it. There’s other technologies that define streaming SQL, like Kafka. Kafka DB has this idea of creating a stream as a part of its definition you can create with SQL. So I’m hoping for something like that, where we just do the similar thing, where we create stream ingestion from, and then provide a select that has a from clause to a topic in Kafka or a topic in Kinesis or something of that nature. That’s what I’m hoping for, but it’s not clearly defined at this point.

[00:15:36.300] – Reena Leone

All right, we’ll see. We’ll keep an eye on it. So if anybody else is like me, and they want to learn more about Apache Druid, where should they go, what should they do?

[00:15:48.190] – Sergio Ferragut

So the first place I go would be to learn.imply.io. That’s where we have where we publish training courses. We have a number of courses that are available right now. They’re currently free. I don’t know if they’ll be free forever, so take advantage of it.

[00:16:04.740] – Reena Leone

Get them now. Get them now.

[00:16:07.510] – Sergio Ferragut

But we have courses right now on Druid basics, on MSQ, so on SQL-based ingestion, on metrics, and on logs, and we’re trying to expand the curriculum. In the learn imply environment. There’s also the Slack channel. We support the community through the Druid Slack Channel. That’s the primary Q&A environment that we use. We also answer questions and stack overflow and Google user groups for Druid. But I’d recommend Slack as the main source of Q&A.

[00:16:46.560] – Reena Leone

And as I kicked off Meetups, you host a lot of meetups and office hours, and I don’t know when the next one is or what the schedule is. How regularly do you do those?

[00:16:58.200] – Sergio Ferragut

So office hours, it is a meet up. So you sign up in Meetup.com, you just join any of the Apache Druid groups that are in Meetup.com, and you’ll see that every two weeks. On Friday, we have at least in the Americas, we actually have every two weeks on different days of the week. We have one for the Americas, one for Europe, and one for Asia Pac, where we do an informal meet up where we invite folks that are either Druid practitioners new to Druid or just want to discuss use cases. And we discuss whatever comes up. Right. Whether it’s somebody with a use case and they want to figure out whether Druid is a good fit, somebody that’s troubleshooting some particular ingestion issue, and we can dig deep into the troubleshooting side. So it’s always an interesting session because it’s always different.

[00:17:55.830] – Reena Leone

Yes. And I would like to encourage people if you’re not familiar with Druid, if you have questions, you guys are immensely helpful. Sergio, you and your whole team with explaining things, especially today. I’ve learned everything I need to know, I think, about SQL-based ingestion, but I want to encourage folks like you don’t have to be an expert in Druid to attend. In fact, it’s better that you aren’t. This is the place to learn that’s right. And everyone’s welcome.

[00:18:19.070] – Sergio Ferragut

That’s what we do in the developer relations team. We’re promoting the use of Druid, and we’re trying to teach people how to do it and how to implement it.

[00:18:27.990] – Reena Leone

Successfully, because that’s one of the keys of open source, too, is the community and learning from each other. That’s how technology gets better. Well, that’s going to do it for this week’s episode. You know what? Thank you, Sergio. Thank you so much for joining me today.

[00:18:42.480] – Sergio Ferragut

Thank you. Thanks for having me.

[00:18:44.470] – Reena Leone

Of course. And so if anyone wants to learn more about Apache Druid, please visit druid.apache.org. And if you want to learn anything more about Imply or join any of the classes or things that we talked about here, please visit imply.io. Until then, keep it real.

Let us help with your analytics apps

Request a Demo