Insights and Airwaves: How Global Delivers Ad Data Freshness with Apache Druid with Miguel Rodrigues

Jul 30, 2024
Reena Leone
 

On this episode, we’re diving into the challenges associated with providing real-time data access for digital advertising campaigns. Who better to weigh in on the importance of real-time data and analytics than Global, a major British media company with some of the biggest radio stations in the UK. And who better to explain how those data freshness challenges were solved than Miguel Rodrigues, Head of Engineering at Global.

Global’s needed a performant database solution to handle their digital ad exchange (DAX). Their previous databases / data warehouse solutions faced performance issues due to the high volume of data. Apache Druid was chosen for its streaming capabilities, scalability, flexibility in schema evolution, and sub-second query performance. By adding Druid to their data architecture, it significantly improved their data freshness and query speeds.

Listen to learn more about:

  • How Global uses Apache Druid with other Apache projects including Kafka, Spark, and Iceberg
  • What you need to consider when getting started with Druid
  • How Imply helped Global achieve quicker time to market, easier upgrades, and better support

Learn More

About the Author

Miguel Rodrigues is an accomplished and forward-thinking Head of Data Engineering with a wealth of experience in driving data-driven initiatives across financial, travel, and media sectors, rising through the years from being a Data Engineer. With a strong foundation in building scalable data platforms, Miguel excels in data integration, processing frameworks, and data contracts, enhancing data-driven decisions and business performance.

Currently leading the Data Engineering efforts at the Global Media company, Miguel has a proven track record of assembling high-performing teams, fostering innovation, and significantly improving data system reliability and efficiency. His technical expertise spans multiple languages and tools, including Python, Scala, Kafka, and AWS, to name a few. Miguel’s strategic approach ensures alignment between technology enhancements and business goals, vitally contributing in the realm of data technology.

Transcript

[00:00:00.000] – Reena Leone

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, real-time data and analytics, but way beyond your basic BI. We’re talking about analytics applications that are taking data and insights to a whole new level.

[00:00:17.910] – Reena Leone

Global is a British media company that owns some of the largest commercial radio companies in Europe and operates seven core radio brands, including ones you may have heard of if you’re in the UK, like Capital, Heart, Gold, Classic FM, Smooth, and LBC. Now, we all know that radio and advertising go hand in hand, but so does ad spend and data availability and freshness. The digital agencies running campaigns on behalf of their clients and placing ads on global’s channels needed access to their data in real-time to understand their ad spend, which originally was a challenge. So how did Global solve that? Joining me to talk through what they did is Miguel Rodrigues, Head of Engineering at Global. Miguel, welcome to the show.

[00:00:56.010] – Miguel Rodrigues

Thank you. Thank you so much, Reena.

[00:00:57.670] – Reena Leone

Okay, so I always like to start with a a little bit about my guests. So can you give me a little bit about your background and how you got to where you are today?

[00:01:05.150] – Miguel Rodrigues

Yeah. So I actually, long story short, started in academia. I got in touch with a lot of open-source projects, things of the sort, but then I wanted to have a more practical impact on what my projects were delivering. I ended up moving to mostly product companies, but also consultancy. I ended up moving into dating engineering quite soon in my career, moving up as the years went by. I worked in different domains, from retail to banking services and financial services, traveling. And now I’m in media here at Global for already a year and a half.

[00:01:47.290] – Reena Leone

And what team are you part of? Because I know Global is a huge company. Yeah.

[00:01:51.620] – Miguel Rodrigues

So the team I’m heading is the data engineering team. It’s part of a wider data team. But here we We have close to 20 engineers in different capacities, being them analytics engineers, warehousing engineers, data ops engineers, data engineers, which end up working different products across the company.

[00:02:16.220] – Reena Leone

Awesome. Let’s dive into what you work on a daily basis. Can you talk about your key use cases?

[00:02:23.660] – Miguel Rodrigues

Here at Global, the data engineering team, they work mainly on the data platform. We also end up delivering different analytics engineering insights into lots of products around the company that end up serving analytics teams, data science teams as well. So products like Global Player or even DAX that will speak a little bit more. We also provide all kinds of data for different kinds of reporting, being it usage or how’s our listenership, where they’re located, et cetera. So we end up being the middlemen on a lot of those data sets. We also, as I mentioned before, are developing a data platform that’s run on data contracts. And that is something that we We’ve been investing for quite a while already.

[00:03:19.040] – Reena Leone

For your data platform, you obviously need a database. Obviously, if you’re here, we’re going to talk about Apache Druid, but were you familiar with Druid before or what prompted a search for a new database?

[00:03:35.140] – Miguel Rodrigues

Yes. For the DAX use case, the digital ad exchange use case, there was this need to record and follow how our digital advertising ad campaigns were doing, and also outdoor monitoring were doing. This is something that comes in with logs every so often. So we had one of two options. We would either try to ingest these logs as they would come so that we could provide very specific KPIs, or we could try in doing more batch processes that maybe fail, and when they would fail, they would be more disruptive when they’d fail. They would give a lot of data regarding impressions, even revenue, which would be very important for our clients, at least listening through rates of these ads, et cetera. So initially, even before I joined, we had a solution, if I’m not mistaken, in Postgres, or maybe even Snowflake. And the biggest difficulty with this was that the infrastructure itself was not performant enough when it came to querying this data because it’s a lot of data.

[00:04:54.670] – Reena Leone

I was going to say, yeah, it’s pretty typical. We hear a lot of people start with Postgres or Snowflake. That’s the go-to when you’re looking for a database right off the get. You can’t go wrong with Postgres for a while. That’s the standard.

[00:05:10.020] – Miguel Rodrigues

Yeah. In our use case, we had for instance, to guess when we had served around 10,000 impressions. And we needed to provide these values quite quickly in case we need, for instance, to in case we needed to owe a rebate or if we were over what we have given away in free advertisement. And what we saw that it took us so long that we ended up being over some of these times. We needed them to support the streaming first use cases, given that’s how the team got their source data into S3, in this case in data was.

[00:05:57.750] – Reena Leone

Yeah, actually, I was going to ask you what some of the key features that you were looking for? Obviously, streaming. Were there any other key factors?

[00:06:05.380] – Miguel Rodrigues

So streaming, one of them definitely, the ingestion we end up doing in through it is using Kafka topics, which is what we’re using. Basically, we end up doing a lot of micro-batching in Spark. We end up having to process data as it comes in in multiple kinds of usage logs. What we also find quite useful as a key feature is the scalability that Druid provides, and the Imply service itself provides when we want to scale up our clusters. And also the flexibility in schema evolution. Of course, what ended up being the deal breaker for us was the sub-second query performance, that basically delivered on that very specific SLA.

[00:06:57.350] – Reena Leone

Okay, so you mentioned using Kafka for streaming, you mentioned S3. Can you walk me through a little bit about, and Spark as well, how your data architecture is set up a little bit and where Druid sits there?

[00:07:11.540] – Miguel Rodrigues

Yeah. We have a very specific S3 location where we receive logs 24/7. We use by Spark structure streaming to ingest those logs along different layers. The default, at least, would be bronze, silver, and gold. That’s the standard that we use. That’s used across other companies. We basically use those to prepare, transform, or enrich the data. We use some other static sources of data to enrich this data. And then we end up pushing all the data into Kafka, where it’s then ingested into Druid. We use multiple ingestion specs. We even have a framework to work around how the ingestion is done using Druid. And basically, the variety and flexibility of ingestion allows us to load from S3, even if anything goes wrong with an update, for instance, and we need to backload the full day’s worth of data, we can also use a very specific and different sources across our tech stack to ingest data into Druid, not only Kafka.

[00:08:26.600] – Reena Leone

And when you talk about how much data are you ingesting into it on, say, a daily basis, and then follow up to that, what are you seeing in terms of your query speeds?

[00:08:36.780] – Miguel Rodrigues

On query speeds, I can give a quick update. Our median is around 170 milliseconds when it comes to query speeds, which is quite nice. I think P99 is around 110. I’m not sure in that number, but it’s close to that. Usually, it’s very rare for us to have any complaints about about performance on these. It’s only when we do major changes in the clusters, like if there’s an upgrade, which is very rare, it may be annual or even six months, maybe something may come up, but then even the rollback capability is quite neat in that aspect. When it comes to volumes, we have our three biggest tables, let’s say. We have an impression delivery, which is sitting around 250 gigabytes a day. We have an inventory one which is around 60 gigabytes a day. Then we have a deal health one, which has around 870 gigabytes a day. So this all totals close to 1.2 terabytes a day with a lot of historical aggregations done beforehand in Spark, and most of the data is ingested in Druid, it ends up being also rolled up on a daily basis, which reduces the amount of data that’s stored.

[00:09:59.310] – Miguel Rodrigues

So We were doing some calculations, and around, we have at least 50% reduction on the data that ends up being in Druid, thanks to this roll-up capability.

[00:10:12.520] – Reena Leone

One thing I know that was a challenge was data freshness and data availability. Can you walk me through how you solve that?

[00:10:21.500] – Miguel Rodrigues

Yeah. We end up providing data, let’s say, 24/7. The The whole process from end-to-end takes about 15 to 20 minutes, depending on time intervals between new source batches arriving. We have basically the original data may take 2 to 10 minutes to arrive. Then it’s the whole processing and then delivering that data in Druid. That’s why we say, to estimate, it’s about 15 to 20 minutes of new data arriving. Basically, As I said before, our median queries run around 170 milliseconds. Actually, I just checked here, our query time is 98% of the time is below or up to 110 milliseconds. We have around 20,000 queries per five minutes. We can do the math of how many those would be per day.

[00:11:23.910] – Reena Leone

I’m not doing the math in my head right now. We don’t have time for that.

[00:11:28.970] – Miguel Rodrigues

We basically, we end up having 150 million events processed per day, which ends up being… I told you we had about 50% reduction is a little bit more than that. We ended up having around 400 output rows out of that.

[00:11:48.120] – Reena Leone

You chose the path of going to Druid with Imply. Can you talk me through that decision and why you chose to go with a vendor like Imply instead instead of open source Druid?

[00:12:01.580] – Miguel Rodrigues

Yeah. So this was a decision that was made beforehand.

[00:12:05.910] – Reena Leone

You’re like, “Yeah, It wasn’t me.”

[00:12:07.380] – Miguel Rodrigues

Yeah, it wasn’t me. But I was able to dig a little bit back on why the decision was made. I also have my own opinion of why that was, of course. For starters, the fact that we’re using an open source technology is really good. And then the fact that we have Imply helps with mostly how much it takes for us to provide any value through it, because if we had to deliver a solution ourselves, we’d have to leverage all the Kubernetes deployments, all the Helm charts, security, et cetera. This is something that when it comes to time to market, it helped us immensely.

[00:12:53.360] – Reena Leone

Have there been additional benefits beyond not having to do all the configuration yourself? Please feel free to share your own opinion, too.

[00:13:04.840] – Miguel Rodrigues

Yeah, as I said, I think a quicker time to market definitely helps. Even managing upgrades, rollbacks, it’s something that’s way easier through the Imply UI. I mean, it’s way easier because otherwise we need a specific team in order to be able to handle this, even specific support issues that we sometimes raise in understanding how Druid works. It’s good to be able to use Imply, but then when it comes to the Druid configuration more specifically, we end up doing that a lot, given all the work that we did with ingestion specs and how we deploy and apply those ingestion specs. So I’d say it’s quite flexible and we can still benefit from both sides of using a totally open source through it and then the managed service that Imply provides.

[00:14:01.480] – Reena Leone

Are there any additional use cases or integrations that you’re exploring?

[00:14:06.050] – Miguel Rodrigues

So, yeah, as I mentioned before, one of our use cases is a data platform where we want to standardize our processes in data engineering. But not only that, we want to standardize where people can analyze their data. We want to reduce the amount of tools we’re using across the business. And with this, we’re looking towards using [Apache] Iceberg as a data table format, which is quite rich and powerful. We would like to be able to use Glue with it because that’s a catalog that we’re using across this data platform. We have more than 300 tables already in it. So it would be quite a win for us if we could directly ingest from that. Although for now, we’re mostly ingesting from S3 when we’re not using Kafka. Marrying those would be a great win, I’d say, when it comes to the partnership. And also thinking of what would be the wider role of Druid across the whole organization. Is it the best tool that we can use for analyzing, for, let’s say, an analytical layer across all of Global or not, that would also be something we’d look for in the future.

[00:15:23.610] – Reena Leone

Do you have any advice for people who are just getting started with Druid?

[00:15:28.200] – Miguel Rodrigues

That’s a good one. I’d say that don’t come to it thinking that it’s your regular database solution. A lot of people tend to minimize that aspect of these tools just because you can do SQL. It’s just your run-of-the-mill warehouse, because it’s not. It’s way more than that. It’s a rather powerful tool to help you keep your data in place and well organized. Sometimes…even the ingestion specs are very powerful, and what they end up doing is way more than you could ask for it to do.

[00:16:08.560] – Miguel Rodrigues

I’d say even things like rollups are a very interesting thing to have in a presentation layer such as this. It’s also… So yeah, come with an open mind, but don’t oversimplify things. Come knowing that it is quite a flexible tool. So be very careful when you apply changes. We’ve had, initially, when we were learning about Druid and how it works, we had some unpleasant surprises, but to our own fault. So it’s been a quite interesting ride, I’d say. We’ve grown immensely when it comes to our knowledge of it.

[00:16:51.850] – Reena Leone

I feel like flexibility and complexity are just two sides of the same coin.

[00:16:58.220] – Miguel Rodrigues

It’s a double-edged sword. I like the fact that, for instance, you can have scheme evolution and all of that in Druid, which is very helpful. But as you said, it’s a double-edged sword. You can even quote some cinema characters here about power and responsibility.

[00:17:22.710] – Reena Leone

I’m surprised that I… Have I done that? I may have done that on last week’s show.

[00:17:28.110] – Miguel Rodrigues

Yeah, it’s possible. But yeah, even the UI is quite helpful. One thing that we would be looking for is even integrating, since I remember now, even integrating with Grafana would be great for us to even have more insights. We use Clarity for some of these insights, the ones I’ve provided, we used it when it comes to query times and all. But then expanding. Don’t come at it in a very simplified… Just does one thing, presentation layer or just a SQL engine. It’s way more than that, and you can integrate and empower what you provide to other teams way more than initially you think of.

[00:18:12.030] – Reena Leone

I mean, at least we’re working with every release towards SQL standard compliance for folks who are familiar with SQL and are using SQL on a regular basis. And going back to what you mentioned about Iceberg, I know Iceberg support was launched a couple of releases ago and has been an ongoing thing. So that should be something I will keep an eye on for you.

[00:18:32.890] – Reena Leone

Miguel, this was great. Thank you so much for joining me today and talking through your project.

[00:18:38.620] – Reena Leone

If you want to learn any more about Global please visit global.com. To learn more about Apache Druid, please visit druid.apache.org. And to hear more about what we’re working on at Imply, please visit imply.io. Until next time. Keep it real.

Let us help with your analytics apps

Request a Demo