Overview
Netflix is a leading subscription-based streaming service that allows its users to watch TV shows and movies on internet-connected devices. Founded in 1997 in Los Gatos, California, Netflix is now a household name globally.
To ensure a consistently great experience to more than 100 million members in more than 190 countries enjoying 125 million hours of TV shows and movies each day, Netflix built an analytics application powered by Apache Druid. By turning log streams into real-time metrics, Netflix is able to see how over 300 million devices (across 4 major UIs) are performing at all times in the field.
Netflix chose Druid because it uniquely meets their high ingestion rate of data, high cardinality, and fast query requirements.
Challenge
An ongoing challenge for Netflix is to consistently deliver a great streaming entertainment experience while continuously pushing innovative technology updates.
As Netflix’s adoption has skyrocketed, this challenge has grown more complex. With over 300 million devices spanning four major UIs including IOS, Android, Smart TVs and their own website, Netflix has a constant need to identify and isolate issues that may only affect a certain group, such as a version of the app, certain types of devices, or particular countries.
Netflix needed to be sure that updates they performed didn’t interfere or downgrade the experience of the users while also ensuring that changes, fixes, and improvements were adding to the experience in a meaningful and measurable way.
Solution
Netflix chose Apache Druid as their database to power their real-time analytics application because it’s uniquely capable of high ingestion rate of event data, with high cardinality and fast query requirements. To quantify how seamlessly users’ devices are handling browsing and playback, Netflix derives measurements using real-time logs from playback devices as a source of events.
Once they have these measures, Netflix feeds them into Druid. Every measure is tagged with anonymized details about the kind of device being used, for example, whether the device is a Smart TV, an iPad or an Android Phone. This enables Netflix to classify devices and view the data according to various aspects. With Druid, this aggregated data is available immediately for querying, either via dashboards or ad-hoc queries.
Netflix leverages Druid to employ A/B testing to assess how updates and changes impact various user groups. It uses the results to compare how the new version performs against the older version to tell whether users on different systems should get the update or not.
Results
Two Million Events and 1.5 Trillion Rows in Near Real Time:
“Druid can make some optimizations in how it stores, distributes, and queries data such that we’re able to scale the datasource to trillions of rows and still achieve query response times in the 10s of milliseconds.”- Ben Sykes, Senior Software Engineer, Netflix
By ingesting over 2 million events per second and querying over 1.5 trillion rows, Netflix engineers are able to pinpoint anomalies within their infrastructure, endpoint activity, and content flow. The ultimate benefit is speed, which is essential for a service that needs to react to a massive number of users in near real time.