Documenting Apache Druid Experiments

Apr 25, 2023
Reena Leone
 

Writing code and building a real-time analytics application is one thing, but writing about your work is an entirely different skill set. How do you explain your process, your data architecture, any challenges you encountered, etc. in a way that is understood by folks who aren’t familiar with the project? How do you tell the story while showcasing your expertise? And how do you decide what the heck to write about in the first place? 

Listen to the episode to learn about:

  • How to create interesting technical content
  • How to overcome writer’s block or not knowing what to write about
  • How to get started with Apache Druid

Learn more

About the guest

Hellmar Becker is a Senior Developer Advocate at Imply. He has worked in data analytics for more than 20 years in various pre- and post-sales roles. Hellmar has worked with large customers in the finance, telco and retail industries, and spent several years at big data company, Hortonworks, and recently at Confluent.

Transcript

[00:00:00.650] – Reena Leone

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, real time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. Today we’re taking a little bit of a departure from use cases and projects to talk about, well, how you talk about use cases and projects. As developers, most of your time is spent building, figuring things out, making things work. But how do you talk about your work? How do you explain how you fixed an issue or share something cool that you tried? How do you come up with what to write about in the first place? Or even how do you present your ideas in front of an audience? To help answer these questions, my guest today is Hellmar Becker, prolific Druid blogger and sales engineer at Imply. Hellmar, welcome to the show.

[00:00:51.340] – Hellmar Becker

Yeah, great to be here. Thanks for having me.

[00:00:54.300] – Reena Leone

So I always like to ask my guests a little bit about their background and how you got to where you are today. So can you tell us a little bit about how you became an amazing Druid blogger and sales engineer at Imply?

[00:01:06.650] – Hellmar Becker

Hi Reena, well, it’s a long story. I’ve been with Imply for the better part of the last three years. I’ve been in the data space for more than 20 years. I used to do what used to be called web analytics back in the day, and then multichannel analytics and various other names, clickstream analytics, if you will. I’ve worked at a bank at one time, then the big data hype came out. I worked for some time at Hortonworks, the big data company, which was the first time I actually encountered Druid, and from there I went to Confluent, the Kafka folks and then to Imply, and here I am, and I’ve done free sales for the better part of the last ten years.

[00:01:46.000] – Reena Leone

So you were familiar with Apache Druid before you came to Imply?

[00:01:49.100] – Hellmar Becker

Well, familiar is probably saying a little bit too much, but I was one of the few people who had heard about it, yes.

[00:01:54.600] – Reena Leone

So we’ve been talking a lot on the show about how Druid has changed in the last year or so, especially with the addition of the MSQ framework. And we’re on the horizon of 26.0 and 27.0, I believe, this year. One thing that you wrote about on your blog is Windows functions. Can you tell me a little bit more about that?

[00:02:17.280] – Hellmar Becker

Yeah, window functions are, I think they are an essential part of modern SQL and model analytics, and it was about time for Druid to offer those as well. What you do with window functions is you relate rows into groups, looking forward, looking back, and you create aggregations without only having a main aggregation. So things like having a ranking inside buckets of values or things like putting clicks together into a session, finding out how does the revenue of an online store relate from today to yesterday’s revenue? All these things are answered with window functions. One of the most important and intriguing use cases is funnel analytics where you follow the journey of a user of any kind of service through a sequence of steps. And it’s crucial that these steps occur in a certain sequence in time. To answer these kinds of questions, you use window functions.

[00:03:14.820] – Reena Leone

And I heard that the community was pretty excited about this one.

[00:03:17.500] – Hellmar Becker

Everyone is excited. So I’m standing with one foot in the community doing the open source and developer related stuff with the other foot I stand in the commercial side of things where we see customers that actually use Druid for mission critical use cases and they are also excited about it.

[00:03:39.160] – Reena Leone

So one of the reasons I wanted to have you on the show today is actually more about the community aspect because you maintain a really popular Apache Druid blog, which I like to just refer to as the Druid Cookbook because sometimes you’ve called your posts that. What made you decide to start your own blog to talk about these things?

[00:03:56.920] – Hellmar Becker

That’s a really great question. Reena I think what happened was that when I first joined Imply more than two years ago, I have to say our documentation left some things to be desired. So there were a lot of things that were undocumented or a bit obscure or were lacking practical examples. So oftentimes I came across a situation where I wanted to find out how something works and then I couldn’t find it in the documentation that irked me. So I thought, okay, I’m going to try and find it out. So I set up an experiment, usually with a little kind of showcase, a little data set, and I try to figure out how things actually work. And then because when I figure something out and then two weeks later, I don’t know anymore how it worked and I don’t like that, so I take notes. And then while I was at it anyway, I thought, why not publish those notes so that others can benefit from it as well. That’s how it all started. And then the other thing was of course also trying to kick myself into the rear part and making sure that once I started to think of discipline to maintain a certain cadence or a certain frequency.

[00:05:09.130] – Hellmar Becker

And once you do that, then you get into this mode where every time you add a tiny little bit and it’s all tiny little bits and then you look at it at the end of the day and all of a sudden you have something.

[00:05:21.450] – Reena Leone

Well, how do you determine what experiments you want to do? Like how do you find what you’re going to write about next or what you want to try next. You’re kind of like a Druid scientist in a way.

[00:05:34.330] – Hellmar Becker

Yeah, but I wouldn’t take all the credit for that because mostly it comes from the field, it comes from customers, it comes from the community. People ask questions because like me, when I first started the blog, there are a lot of other people out there that just say I want to know if we can do this or that within Druid or I want to know how it works, or if I’m in this kind of scenario, what happens? How does the software behave? And they don’t know, and they don’t know how to figure it out. I mean, I’m kind of intrigued by many of these questions and then I set out and try to find out.

[00:06:10.790] – Reena Leone

What do you do if something you try just doesn’t work? Do you still document that?

[00:06:15.420] – Hellmar Becker

It depends. Well, first of all, every blog I try to keep my blog post short and I think we are probably talking about that a little bit later, how to set the bar at the right kind of height. Because when you start writing a blog and then you aim at writing the next version of a War and Peace, then there’s a danger that you never get started. So every blog post that usually I document the happy path, but to every blog post that I write, I should probably write a kind of side or companion blog where I document other things, how they can go wrong and how I went wrong in the first place and had to backtrack and try a different thing. So that is one thing that happens. The other thing that happens is sometimes because some of my experiments are actually at the bleeding edge. So I do not only take the released code, but I also pull the source code as it is right now from GitHub and I build the latest and greatest additions and I try to experiment with it. And then there is a chance that sometimes things don’t work at all.

[00:07:23.340] – Hellmar Becker

Now, the way I organize my blog, there is the posts which are in one directory on my GitHub and then there are the drafts. And sometimes drafts can be sitting there for quite a while because at the moment when I first had that idea, I hit a snag and I couldn’t resolve it because the software just hasn’t been ready. And then sometimes I revisit these drafts two or three months later and make them into a published post.

[00:07:49.420] – Reena Leone

Then I think that’s important though, because you don’t want to just throw in the towel because you hit a snag. And that’s actually one of the beauties of open source software is that something might not be working right this very second, but there are folks working on it so that it does work in the future. It just might not be ready. And you’re testing things out early anyway, so you might run into that.

[00:08:08.910] – Hellmar Becker

Right. And the other thing, Reena, is that because I’m with Imply and so I have all these connections I can file requests to our product and engineering teams. And then sometimes I write down I’ve tried this and this and this and it didn’t work. And I get very, very positive feedback because then the folks in engineering said, oh yeah, we never had that idea, but this is actually a bug that needs to be fixed and then they fix it.

[00:08:32.050] – Reena Leone

Essentially you’re helping make Druid better that way though, because you’re finding things and that’s like kind of another point of open source is to find the bugs, find the fixes, make the technology continuously better.

[00:08:43.450] – Hellmar Becker

I’d be very happy if it works out like this. Yeah.

[00:08:45.860] – Reena Leone

So if someone wants to, like they’re working in Druid or they’re working in another technology and they want to start writing about it, how did they get started? What would you recommend some tips for them?

[00:08:55.950] – Hellmar Becker

Oh, there are various tips. I mean, I’ve been talking to some of my colleagues. Many of my Imply colleagues have now started their own blogs. Probably the lowest threshold approach is to use one of those readymade platforms that are out there. You could use well, if you’re old fashioned you would use something like WordPress. The more modern ways would be medium or substack or anything like that. There are also developer related similar websites that just get you started. Now that is of course, when you are on such a platform, then the platform kind of owns the content. So that’s not how I do it. I run my own server. It’s a virtual server at a hosting provider. On that server I have my own web server. I’m running Jackhill, which is a static site generator. But you don’t have to go all that way. So that’s for the technical setup. Like I said, there are lots of platforms that you can start with without putting in a lot of effort. The other thing is it’s important to really keep the bar relatively low at the beginning. I think when I talk to my colleagues, many of them feel that they have to create something perfect and something really elaborate in the beginning and that can be scary.

[00:10:11.710] – Hellmar Becker

I try to go back to the idea of what blogging was supposed to be when it first came up, like 15 years ago, when a blog, well, a web log as it’s literally translated was something like a kind of virtual diary where you wrote down short notes as they came to your mind.

[00:10:32.410] – Reena Leone

I had a live journal. I get you. I’m from that generation.

[00:10:36.970] – Hellmar Becker

And you’re not trying to write War on Peace, you’re not trying to win the Pulitzer Prize. It’s just you want to get something out that can be relatively raw in a way and that is also valid for a certain point in time only. So if you go back to some of my blog posts from one and a half years ago, then they might not be valid anymore because they describe a particular workaround that was necessary at that time and by now that has been overcome by more recent versions. But yeah, it is like a kind of time capsule from that point in time. And I don’t go back and change it. It’s a log. So it’s also an immutable thing.

[00:11:13.270] – Reena Leone

I mean, this kind of hurts my heart to say this, but there’s a difference between say, writing for your company and then maintaining your own personal blog. Why would you choose to maintain your own personal blog instead of putting all of your knowledge into a corporate blog?

[00:11:30.470] – Hellmar Becker

Well, I mean, I don’t want to.

[00:11:32.050] – Reena Leone

Hurt anyone, but spicy question for me.

[00:11:35.310] – Hellmar Becker

It’s a spicy yeah, in a way. Well, usually with a corporate blog also comes a process of vetting and approvals and people who want to be asked and specific formatting rules and well, basically lots of hoops to jump through before you even get started. And again, I want to keep it easy. And the second thing is, I’m also true to the open source idea. So all my blog content is in my blog for anybody to use. It’s also in GitHub if anybody wants to use the markdown code. And some of actually Imply has decided in the past a few times to cross post my blog post to the company blog, which I’m happy if it if it happens.

[00:12:22.610] – Reena Leone

It’s usually like the larger stories or something that’s more detailed or rather than just the logs. And I think that’s like a great opportunity to find that balance. Right.

[00:12:34.230] – Hellmar Becker

Yeah.

[00:12:34.790] – Reena Leone

Here’s another question. So you mentioned this a little bit earlier. So what’s the difference between say, documentation and then documenting your work on a blog? Because I feel like a lot of developers are familiar with documentation and some of them have written it, or at least everyone has used it. But what would be the difference between focusing your writing and efforts on documentation versus blogging?

[00:12:58.170] – Hellmar Becker

Documentation is, I think, something that really has to be prim and proper. It has to be properly formatted, it has to be has to supposedly cover all the aspects. So there is a higher standard that you put to documentation, which is why there are full time people, full time specialists who write documentation. For me, my blog is like a side gig, something that sometimes I put in like three or 4 hours on a Saturday afternoon to write.

[00:13:27.920] – Reena Leone

Well, I think that’s part of it too. If you’re writing for yourself, it’s to establish your own voice, you’re showcasing your work and it should be fun, right? It’s fun to figure things out, right?

[00:13:42.600] – Hellmar Becker

Yeah. Sometimes if there’s a gap in the documentation, then I write a blog and then somebody from the docs team approaches me and says, Hellmar, can we use your content and reformat it a bit? And put that in the documentation, I say, yeah, sure, I’m happy about that as well.

[00:13:57.980] – Reena Leone

Okay, so speaking about personal voice and shifting gears a little bit, I know that you don’t just write a blog and maintain a blog, but you also present and you speak at a lot of different meetups, not just their Druid ones. How do you go about presenting at different meetups and getting involved in that way?

[00:14:14.930] – Hellmar Becker

That is actually many questions in one question, Reena.  First thing is again, how to find out what is the right kind of content to present. That comes often also from my experience in the field, from my experience with community members, with customers, the questions that they ask and they often lead to something like creating a demo or creating a little showcase. And then again, it’s important to make this into a story. So that is one thing. The other thing is how to find opportunities. Well, once you have a network in that kind of community, then you also find the opportunities. So I look for meetups in my neighborhood. Well, there isn’t. I hope there will be more open source Druid meetups in the future.

[00:15:00.190] – Reena Leone

And you’re in Berlin for anybody listening, right?

[00:15:02.590] – Hellmar Becker

I’m in Berlin, but that is actually interesting. This is an open source infrastructure meetup that is done by it’s organized by a company called Ivan who offer Kafka as a cloud service. And yeah, I know the community manager that is organizing this meetup. And I said, hey, how about I speak at your meetup? And she says, yeah, that’s a great idea. And that is how it usually happens. Same with conferences.

[00:15:28.180] – Reena Leone

So you’re saying if somebody sees a meetup that seems interesting, even if it’s not exactly Druid related, just to reach out and see.

[00:15:36.030] – Hellmar Becker

Well, the thing is, some organizations or some meetup organizers actually try to keep themselves friendly and they write a note in the meetup description page saying that even if you are a first time speaker, even if you don’t have a lot of experience, this is a friendly forum where you can try yourself out. So it’s not like you have to stand in front of 250 people and do a big show at whatever Kafka summit, but you are within a small friendly space with 20-30 people where you can also practice a bit. And I would absolutely underwrite that. And yeah, it’s contoured to reaching out and making contact with the organizers. They are usually very friendly and many are happy to find and get new content, presenters, fresh faces.

[00:16:24.450] – Reena Leone

Do you have any meetups coming up that you want to promote here? Because I feel like you always have something going on.

[00:16:30.400] – Hellmar Becker

I have a lot going on, but yeah, like you said so and on the 24 May, it’s not quite official yet, but then I’ll be speaking in Berlin. There is probably something coming, a big data event in Budapest, in Hungary, and I’ve also got a conference talk in Lithuania set up for November. So that is what I have currently on the radar and probably there’s one or the other things still coming within the next months.

[00:17:03.720] – Reena Leone

And a quick question for you. So for folks who are listening, who shockingly, maybe have never tried out Druid or don’t know how to use Druid, how would you recommend that they get started playing around with it and figuring it out?

[00:17:16.220] – Hellmar Becker

There is a lot of content nowadays, starting with Imply’s quickstart tutorial. We’ve got a Druid academy that is also sponsored by Imply, but that is completely open source. Yeah, just get started. Follow the tutorials at some point use your own data. So usually the simple tutorials start with one day’s worth of Wikipedia edits. There is a bunch of other free data sets out there. At some point you will want to play around with streaming, live streaming data ingestion. Then you may want to set up your own. There are free data generators out there and again, many of these community folks and DevRel folks that work in the data streaming space, they also take fun in writing little scripts that do these data simulations.

[00:18:07.550] – Reena Leone

Didn’t you just do one with like flight data for your blog?

[00:18:11.430] – Hellmar Becker

That would be exactly but that would have been the next thing. You are giving me a great segue here. So what I did is I have probably a lot of people, I’ve got this little Raspberry Pi that is sitting on my desk at my house and listening to flight radar data. So basically it sees whenever an airplane is flying over my house and collects the data and creates a little live stream that goes then into a Kafka conference service. And that is one of the sets of demo data. Actually, I think Darin [Director of Technology at Imply], our tech marketing guy, he just asked me a couple of days ago Hellmar, can I use your data stream? I said, yeah, of course.

[00:18:52.810] – Reena Leone

Oh yeah, I think that actually was used in a webinar. I think we used your data to do a demo.

[00:18:59.930] – Hellmar Becker

People might be a bit surprised why all these flight data are based around Munich in Germany. But that’s why it’s because it’s all collected at my house.

[00:19:09.340] – Reena Leone

I mean, I could do that. I’m under a flight path in Boston. Although Logan International Airport.

[00:19:15.970] – Hellmar Becker

You could. And if you have a Raspberry pi, well, if you already have one because right now I heard they are a bit hard to get by, unfortunately. But if you have one and if you have a flight data collector, then I can give you that little piece of software which I wrote that connects these flight data to the Kafka stream that we are running in M+ Confluent Cloud.

[00:19:36.700] – Reena Leone

But if for some reason you don’t have a Raspberry Pi and you don’t live under a main flight path, there are lots of other data streams and data sets available that you can use to test. Although that does seem fun, making your own. I think that’s going to do it for us today. Hellmar, this was great. I feel very inspired, and now I want to kind of map the planes above my house. But thank you for joining me today.

[00:20:01.170] – Hellmar Becker

Thank you for having me, and it was a pleasure talking to you.

[00:20:04.000] – Reena Leone

So if you want to know anything else about Druid or get started yourself, visit druid.apache.org. And if you want to learn more about Imply or check out our documentation on Druid, please visit imply.io. Until next time, keep it real.

Let us help with your analytics apps

Request a Demo