Accurate, Validated, and Real Time: Diving into Reddit’s Druid-powered Ad Platform
Let’s go behind-the-scenes on the “front page of the internet” to see why Reddit chose Apache Druid to help power their ad platform.
Reddit built an automated ad serving pipeline that analyzes campaign budgets and real-time plus historical user activity data to decide which ad to place, all within 30 milliseconds. The platform allows advertisers to create ad campaigns and set both daily and lifetime budgets for a campaign.
After undergoing a series of experiments with batch-only and streaming-to-batch solutions, Reddit opted for Druid as their real-time analytics database. With batch-only and streaming-to-batch solutions, they encountered challenges with over-delivery, where an advertiser’s budget was spent too quickly, and under-delivery, where the budget was spent too slowly. Achieving efficient delivery, which involves spending the appropriate amount at the optimal time, necessitates minimal delay in providing budget spend data to the ad serving infrastructure.
Learn how Reddit:
- Balanced the overdelivery or under-delivery of ad spend
- Tackled technical challenges associated with building a lifetime datasource that needed to store over 4 years of lifetime spend data while still maintaining low query latency.
- Achieved eventual 100% accuracy, even in the case of Apache Kafka or Apache Flink outages
- Performs aggregations at ingestion time without slowing performance
- Druid Summit 2022: Low Latency Real Time Ads Pacing Queries with reddit
- Top Use Cases for Real-time Analytics in Advertising
- Stream big, think bigger—Analyze streaming data at scale in 2023
About the guest
Lakshmi Ramasubramanian is a Staff Software Engineer, has been working with Reddit Ads platform since the inception of Ads at Reddit. She is interested in building data pipelines at scale, real time streaming and ingestion. She loves to hike and spend time with her cat.
Accurate, Validated, and Real Time: Diving into Reddit’s Druid-powered Ad Platform with Lakshmi Ramasubramanian
How do ads work on the “front page of the internet?” On today’s episode, staff software engineer at Reddit Lakshmi Ramasubramanian discusses Reddit’s ad platform, including how it handles ad pacing, real-time data, and more. We’ll dive into the challenges they needed to solve and why Apache Druid was the right database for the job.
[00:00:00.410] – Speaker 1
Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, arena from Imply, and I’m here to bring you stories from developers doing cool things with analytics way beyond your basic bi. I’m talking about analytics applications that are taking data and insights to a whole new level. Now, I’m sure if you’re listening to this right now, you’re familiar with Reddit, a platform with forums and communities for pretty much everything. Whatever you’re interests might be, there’s probably a subreddit for you. I go there for the AMAs ask me anythings, travel tips, like how to get a reservation at the Kirby Cafe in Tokyo. True story. Cozy home decor and anime discussions. Personally and like most major social sites, there are ads. But what’s the platform like? How do ads work on Reddit? What does the scale look like in terms of events and reporting when you’re running ads on what’s known as the front page of the Internet, right. To talk us through how things worked, I’m joined by Lakshmi staff software engineer at Reddit. Lakshmi, welcome to the show.
[00:01:02.660] – Speaker 2
Thank you, Reena. Happy to be here.
[00:01:04.960] – Speaker 1
Awesome. So I like to start off with a little bit about you, the person behind the project. I think engineers and backend developers especially don’t always get the spotlight they deserve because they’re just out there making things work. After doing a little bit of research for this episode, you’ve been building analytics applications with databases for a while now, or just applications in general. Let’s talk about how you got to where you are today. What do you find most interesting about this space?
[00:01:32.620] – Speaker 2
Yeah, I started a long time ago. software engineer, does my engineering degree and started working at many companies before Reddit. I was lucky that I got to work on scratch development projects from the beginning, and I worked with postgres, MySQL many databases, so it was a natural transition to become a back end or a data engineer. So what I find most interesting about this is it’s exciting to see something go live. It’s exciting to see that some people are clicking something somewhere remotely or something is happening, and then we provide the data from somewhere else in the world or the data is pre populated there’s many technology goes behind it. But it’s exciting to see that I’m able to do something that people can actually access in their fingertips. So that’s kind of interesting for me.
[00:02:28.800] – Speaker 1
No, that’s super cool. And you’re at Reddit right now. Can you talk to me a little bit about what you’re working on at Reddit?
[00:02:34.590] – Speaker 2
Yeah, I am with the ad events and platform team at Reddit. We are responsible for everything that happens within ads in perspective of the data. Of course, like, any data that comes from goes out to the user or any reporting events that are shown to the advertiser or any impressions or clicks or anything that comes within Reddit. All of that we take into account, we validate it, we attribute it, and our team does a lot of data analytics and pacing and reporting work as well.
[00:03:09.050] – Speaker 1
I actually had a chance to catch your 2022 Druid Summit presentation with your colleague Brianna Greenberg on low latency real time ad pacing queries. I wanted to chat a little bit about the back end system you and your team built. Can you tell me a little bit more about it? I know that you switched over to Druid a few years ago, correct?
[00:03:27.190] – Speaker 2
Yes, that’s right. We switched over to Druid for reporting use case, but the talk that I gave was more about pacing. I’ll talk about what pacing as part of that. So pacing is when a user comes on to Reddit and we decided that, no, there are ad slots and we have to show ads to the user. What happens is a request to the ad server is made and at a moment’s notice, we need to make sure that there are ads available and there is budget available from the advertisers. There’s a lot of data that is going on back there. So we compute all of them and we make it available for the ad server to make smart decisions. Ad server itself does a lot of smart decisions. I’m not going to talk about that here. But what we do is provide data to make those decisions happen. So when the request is received by the ad server, we want to make the data available to query and it should have all the information that they want. How much is advertiser budget? Are they within the advertiser budget? Have they served enough impressions before or have they served enough clicks before?
[00:04:34.920] – Speaker 2
Can we serve more? You don’t want to show ten or 15 times the same ad to the user. So a lot of logic goes into ad servers. So for all this logic and auction process to happen, we need the data to be available. So what we do with the system is we provide the data for the system.
[00:04:53.010] – Speaker 1
I imagine there is a lot of data that comes through.
[00:04:56.910] – Speaker 2
Yes, there is.
[00:04:59.090] – Speaker 1
Obviously, ads are functioning with real-time data. You’re managing spend, and real-time data has been a real hot topic on the show so far. I believe your setup used to have batch processing with validation and aggregation, and then you did a layer of stream processing on top, right?
[00:05:21.350] – Speaker 2
Yes, in a manner of speaking, real-time data is necessary. Like I said, the system is very, very real time. Somebody coming on to Reddit is very real time. We need to be able to present ads to it right that instant. So we want a real time system for sure. We were making ends meet with separate validation pipelines, separate streaming pipelines. So what we did with this particular system is that we transformed it into one system. So we did the validation and aggregation and we use Druid because with Kafka integration. So we just made use of that and we adjusted data into Druid for real-time querying.
[00:05:58.110] – Speaker 1
How did you go from kind of having like two systems to one system? How did you find the balance? So, I know one of the key things you’re trying to solve for is like over delivery or under delivery of ad spend. So how did you kind of figure that out?
[00:06:12.250] – Speaker 2
Yeah, so what happens when we do not provide data, when it is only batch processing is by the inherent nature of batch processing, the data is available at specific intervals and that’s not good enough because the data is stale. Or who knows, the advertiser might have changed the budget or the user might have clicked on it multiple times and we may not have counted it. So that causes us to make do over delivery. Presenting no data or stale data makes us do over delivery. So we definitely wanted a real-time system. But what happens is that because of the huge volume of data that we discussed, processing this data in real time is very challenging. What we did is we did batch processing for whatever time we did. And to make up the difference, we did maybe 15 minutes or half an hour or 1 hour only that small amount of data was real time. Instead, with this new application, what we built with Kafka and Druid, we made everything real time.
[00:07:15.530] – Speaker 1
Maybe selfishly, as part of this podcast, it’s always me learning new things. I wanted to talk can you talk to me a little bit about the architecture set up? I know that you had mentioned you use the Lambda architecture. I did, out of look up and it’s kind of like having two engines, I believe, that kind of run together. Can you talk to me a little bit more about your setup there?
[00:07:34.880] – Speaker 2
Yeah, lambda system is used in mostly big data where you want to there is an API layer. For us, the API layer is ad serving especially for the pacing use case. And for this query layer, we wanted to provide this data and the data can come from batch and real time both. So it’s the hybrid model. It’s not just real time and it’s not just batch. How you use the batch depends on what the application wants. So when you have these three components, it is called the Lambda architecture. For pacing architecture, definitely within Ads, we use this Lambda architecture. So the data is definitely streamed real time. And we also supplemented with the Lambda architecture for different query needs.
[00:08:17.040] – Speaker 1
And so I believe your current setup has we talked about Druid. I mentioned via obviously this podcast talks a lot about Druid, but I believe you’re also using Apache Flink. If I’m correct?
[00:08:29.460] – Speaker 2
Yes, Flink is our validation engine. So the ad server responds to the user request, right? So there are a bunch of events that we know that we served, but we don’t know when that is one stream, but we don’t know when the user clicks on it or when the user watches a video or something like that. So these are user events. So the events that are served by the ad server are called server events. And the events that we get from the user are called server, called user events. What we need to do is attribute these user events to the server events. We want to make sure that this is a click on this particular impression. We want to make sure that the impression was the correct impression for this ad we served. So that is the validation pipeline. We built the validation pipeline on Flink. So we read from these two streams, we do all the validation, we do the attribution, we make sure there’s no point counting a click that happens after a long period of time. So that is another example of validation we do on the Flink engine. So we use Flink to read from these two topics, kafka topics that is also streamed towards via Kafka and we stream them, validate, attribute them, ignore the data that we think that it’s irrelevant, and send this data back again to Kafka.
[00:09:45.950] – Speaker 2
And from then on, we let Druid ingest this data and present it. Pacing reporting.
[00:09:51.960] – Speaker 1
When you were looking for basically a new engine, how did you come across Druid? We actually had an Imply customer mentioned recently. I don’t know if I can name them, but they said that they Googled “OLAP database” and that’s how they came across to it a pretty basic term, and I’m a good job on SEO there, but I’m always interested in how engineers and developers find and evaluate open source technology.
[00:10:20.950] – Speaker 2
That’s an interesting question. I want to say most of my answer might also be Googling or searching for it. What helped us is that whenever we start something as a POC, we do multiple tests. Like we ran it with Postgres, we do it with Redis, we do it with in this particular case, when we were choosing between Flink, we tried Apache Beam. We were running reporting on a different engine before this. So we did try multiple POCs, we did proof of concepts and also we heard from a lot of other colleagues, I also have used Druid before, Reddit as well, but it was in-house managed. They were doing their own Druid management. And I don’t think we were at the time capable of doing that or we didn’t have resources to do that. So we decided that Druid is the perfect use case, at least especially for reporting a lot of data slicing and dicing by various dimensions. It fit the use case perfectly well. We knew that that’s what we wanted. We did a POC and then we reached out to Imply.
[00:11:26.270] – Speaker 1
I just like to hear how people find things because the internet is an infinite space.
[00:11:29.920] – Speaker 2
One other thing that helps us is attending Kafka Summit and Druid Summit. We do hear about a bunch of things there. So as engineers I think we all are looking forward to either one of the summits to happen. Especially real-time analytics and databases are hot now, so yes, that also helped.
[00:11:46.910] – Speaker 1
Analytics database- so hot right now. Yes, we have a ton of events coming up so hopefully we’ll get to see you there and chat for a while in person. That would be great. So you evaluated multiple POCs, you chose to Druid. What were some of the challenges that you were looking to solve with your old system?
[00:12:03.340] – Speaker 2
Yes, the old system like I said, was batched. It was delayed. We were not providing data maybe for an hour. Let’s say the batch job runs once an hour. So for 1 hour of data we didn’t know what was happening so we had to do other streaming mechanism to catch up with the data which is probably not correct because we were not running the exact same pipeline and it was maybe very less attributed. Let’s say we want to attribute clicks for an hour. So we want to give the users maybe 1 hour for the click to come back to us. But the catching up real time mechanism was not quite doing that. It was it would probably look for ten minutes of click data. So we are getting most of the data but not all of the data that we want and that was causing us to do a little bit of over delivery, a little bit. We will catch up eventually, right? As the day goes, we will figure out that the Lambda architecture will help us. But it was not accurate. So this new system helped us get the data attributed and it is accurate, it’s deduplicated and we know that this is what we are going to use for any other processing, reporting or billing.
[00:13:14.650] – Speaker 2
So the system is all powered by one pipeline which is one streaming validation engine, not multiple ones. It’s accurate, it’s validated and it is also real time. It’s an added advantage.
[00:13:27.180] – Speaker 1
And so I know that you said that you’ve evaluated Druid, but are you using Druid for any other use cases moving forward? I know it’s kind of involved in a couple of different things.
[00:13:45.830] – Speaker 2
Yes, we started with reporting. That was our first use case and we realized that this has much more capabilities and we could perhaps use this for real time querying as well, not just analytics and reporting. That’s why we ended up with using this for pacing. So the ad server can make request on a minutely basis or something super fast, not just reporting. So right now we are using it for reporting. Pacing. Forecasting is another use case that we use. Actually we do use Druid. Hechelo sketches for forecasting. That is something we are looking forward to expand as well. But yeah, that’s another use case that we are currently using in production.
[00:14:29.670] – Speaker 1
Okay, so Druid, decide what other projects do you have coming up that your team is working on? I think before when we were kind of talking, you mentioned real time reporting and analytics.
[00:14:41.230] – Speaker 2
Yes. So far reporting has only batch process for us. We have not been doing real time reporting or maybe not completely, but right now we are onboarding for full real time reporting view for the user on our dashboard. So that is something that we are currently working on. Something else if you’re working on is we have not been using Druid for analytics quite yet. Maybe internal analytics, not in a larger scale, not dashboards or not something fancy dashboards. So we are looking forward to use that as well.
[00:15:14.660] – Speaker 1
Were there any platform improvements? There have been a lot of different Druid releases that have come out. We get a lot of stuff coming out in the future. I actually just talked about some of the MSQ-enabled features that are coming this year. I don’t know how much you’ve heard about that or anything that your team is looking forward to or that would help improve the performance of your platform.
[00:15:39.980] – Speaker 2
Yeah, I think we did have a need or a presentation in how to use Msq basically. So we do want to use but we are building some sort of a self serve kind of analytics platform or data ingestion platform. So for that we are definitely looking forward to use Msq in the future. At the moment it is all native ingestion and we have not been using Msqa but that is definitely in the cards. Msqa especially will most likely be useful because it’s equal based condition. Should be fairly straightforward to do that. I can’t say that we are using Druid the way it is supposed to be. Like there is a lot more optimization and a lot more platform optimizations especially. We wouldn’t have to like query landing or doing a lot of optimization necessary that we have not looked into. So that’s definitely in the car.
[00:16:37.840] – Speaker 1
Can you tell me a little more about the self serve analytics? Because I think that’s really interesting. How much can you tell me?
[00:16:44.930] – Speaker 2
Okay, it’s all arbitrary right now. Nothing is written down on it. We don’t have a vision of exactly how we want this to look like. But what we are looking forward is eventually if somebody else is building their own data processing pipeline, where we don’t have to worry about the data itself, we only have to worry about how we serve the data. They can point to where it is stored. Maybe Kafka, maybe S3, any cloud storage, we just use SQL to pass the data and ingest into drew it so that they can query however they want. So all we are providing is Druid-as-a-Service interesting for that use case. So we can be hands off of saying that, hey, is there ingestion spec right? Have you written proper? Have you written this correct order? Are you using X amount of concurrency to ingest data? So we can ignore all of that. And it can be very simple thing that ingest this from this SG pipeline or from GCSs anywhere and just ingestion the Druid and give me the results. Maybe be looking to do something else.
[00:17:53.190] – Speaker 1
Okay, so full disclosure, reddit is an Imply customer. Can you talk us through a little bit about the work that your team does with not my team, but my company, Imply, of course.
[00:18:08.010] – Speaker 2
One thing I have to say is we’ve learned a lot from talking to SA various people, many help. Whenever we want some help, we definitely reach out to them. They’ve helped us in finding the right vCPU for our team, especially for a use case. Like we say that this is what we want to do. Or when we were asking how do we optimize, what do you think should go in the cold tier and what do you think should go in the hot tier? So all of these, the solutions architect hours have been very useful. We are still working with them. We want to work with them more. Finding, analyzing query patterns and logging and monitoring something that we are looking to do in the future as well. But yeah, working with Imply, I definitely can see that. I’ve learned a lot.
[00:18:59.760] – Speaker 1
That’s always great to hear because even if you’re at a big company like Reddit, when you’re dealing with open source technology, you’re kind of on your own a lot. And when troubles come up or you’re trying to figure something out, it’s just easier to have the support than having to do it all by yourself.
[00:19:17.170] – Speaker 2
Yes, that’s true. It is, definitely. And also it doesn’t feel like we are reinventing the wheel. We are learning from people who know or who have used this a lot before.
[00:19:27.960] – Speaker 1
So it’s low latency with support. We’re getting to you very quickly, helping you solve problems.
[00:19:34.070] – Speaker 2
[00:19:34.860] – Speaker 1
Lakshmi, thank you so much for joining us today and talking us through what you’re working on at Reddit. If you would like to see Lakshmi’s presentation from Druid Summit along with all the other fantastic presentations, they’re all available now at druidsummit.org. If you want to learn more about Apache Druid or Imply, please visit imply.io. Thanks again for listening and until next time, keep it real.