“Thanks to Apache Druid we feel not only that we’ve caught up with our competitors - the likes of Adobe, Facebook, Snapchat, and Google – we are poised to take it to the next level.”
Jeetal Shah (Director of Engineering) and Kevin Peng (Senior Technology Lead) are responsible for delivering a low-latency, high-throughput analytical pipeline at Amobee. The need to remain competitive in this market is obvious to Jeetal:
“The Steve Jobs philosophy is this: catching up is a losing game – you have to take the lead. With Druid as the centrepiece of our architecture, it has allowed us to take that leap ahead. It has changed our life!”
Jeetal and Kevin’s teams have responsibility for bringing each week’s trillions of rows of millisecond-level data together from multiple platforms - to converge them - and to provide statistical information not only for their customers, but for other internal teams and systems (like intelligent bidding) as well as third-party partners. All this must be accomplished quickly, efficiently, and cheaply.
Within a year of Kevin introducing Druid in 2019, it has more than proven itself as the “magical ingredient” for delivering fast, flexible insights.
“We had taken a standard pre-processing approach. We created different aggregated views of underlying data tables each with different periodic aggregates, monthly, daily, and so on, that people could query. But when giving out reports to users, this introduced massive problems.”
“We were always constrained by that pre-computed view of ‘daily’. ‘Daily’ in the US Eastern time zone or US Pacific time zone or UK timezone are not the same - it was impossible to give true ‘daily’ statistics to the customers in international markets.”
“And if we wanted to add another field, it wasn’t just adding another field and you’re done: you had to add the field to all these other aggregates.”
Kevin describes a moment that many Druid adoptees share: they were asking themselves, “why do I have to do all this work just to add another column, or to give a specific market the daily aggregate that they need? We’re paying a fortune for processing and for storage - why can’t we just have a simple, wide view of a table? There must be something to help make this easier and more efficient”
That something, they discovered, was Druid.
The road to scale is paved with good intentions
Amobee started by implementing Apache Spark on Apache Parquet-format data in a data lake. But they had to make some concessions. Accepting that it may not provide real-time insights or sub-second responses, they felt they could at least solve for offline, standardised reports when the response was required within an hour. “At least there were no additional storage costs as we gained a single Apache Hive and Spark view of the data.”
But increasing adoption of Spark as a solution created another problem – concurrency:
“Five hundred requests would come in within a short period of time, and the Yarn resources would get maxed out. We worked out the amount of money needed to increase resources to meet that demand and it was double, triple, quadruple what we were paying. I mean, that’s just not feasible.”
Enter our hero: a fellow Druid community member working in another area of Amobee where Druid was already deployed. Two brokers, three data servers and a master server later, Kevin’s team ingested the Parquet data and “snap - queries we waited hours for came back sub-second!”
Armed with a rolled-up view provided by Druid, Amobee were finally free to ask for statistics for any time period relevant to any market internationally, to have a fluid schema, and to do all this at much reduced cost. No more multiple aggregates by timezone, just one consistent view that they can slice and dice in any way they want.
“We used to deliver reports globally at 9 a.m. Pacific time - now we deliver them their reports at 9 a.m. local time for their market’s timezone. It’s a significant benefit for our customers - especially internationally.”
Not all heroes wear capes
Long term, Kevin and Jeetal are researching how to use Druid to move into a Lambda architecture, incorporating streaming data.
“With streaming, when a campaign is created, people will be able to get instant insights from Kafka alongside further insights given by batch data.”
Amobee has a fully penetration-tested API access layer that fields requests and returns data, whether that means outputting results into S3 or returning them directly in JSON. Upstream, the team has great success using Apache Airflow for early data processing, Parquet file generation, and streamlined Druid ingestion spec creation and submission.
Kevin and Jeetal are heroes: through Druid they have truly revolutionised Amobee, no doubt contributing to Amobee’s move into the Leader’s section of Gartner’s Magic Quadrant for advertising technology.
We extend our greatest thanks for this chance to tell their story, for their involvement in the Druid community channels (especially Slack, where Kevin is a consistently active participant!) and we are sure the community is looking forward to hearing more from them in the future.