Overview
Netflix is the world’s leading streaming platform, serving hundreds of millions of viewers across devices, regions, and networks. Its mission: provide a consistently smooth, high-quality viewing experience while continuously shipping new features and updates at global scale.
Challenge
Delivering uninterrupted streaming quality is critical to Netflix’s brand reputation. As Ben Sykes explained, “Ensuring a consistently great Netflix experience while continuously pushing innovative technology updates is no easy feat.”
The scale of the problem was immense. Playback devices generate over 2 million events per second — more than 115 billion rows of data every day. Viewers stream on Smart TVs, mobile devices, consoles, and browsers, each with unique software versions and network conditions. At the same time, Netflix pushes frequent app and platform updates, which must be validated in real time to avoid regressions. Traditional monitoring systems couldn’t keep pace with this volume or deliver the low-latency insights Netflix required.
Solution
To meet these demands, Netflix built a real-time observability platform using Kafka for event ingestion and Druid for analytics at scale. Real-time logs from playback devices capture startup times, buffering rates, error codes, and other performance signals, which flow through Kafka into Druid.
With Druid’s subsecond querying, engineers can drill into trillions of rows in milliseconds, gaining instant visibility into playback metrics across devices, regions, and versions. Controlled rollouts allow new app versions to be tested against baseline metrics, with automatic rollbacks triggered if regressions are detected. Continuous anomaly detection ensures that issues are identified and resolved before they impact large audiences.
“Using real-time logs from playback devices as a source of events, we derive measurements in order to understand and quantify how seamlessly users’ devices are handling browsing and playback.”
— Ben Sykes, Netflix
Results
- Scale — Ingests 2M+ events per second and processes 115B new rows daily
- Speed — Queries return in tens of milliseconds, even across trillions of rows
- Granularity — Device, region, and version tags isolate issues with precision
- Reliability — 100% uptime during global streaming peaks ensures uninterrupted experiences
- Confidence — Engineers ship features faster, knowing regressions will be caught instantly
“Every measure is tagged with anonymized details about the kind of device being used… enabling us to isolate issues that may only affect a certain group, such as a version of the app, certain types of devices, or particular countries.”
— Ben Sykes, Netflix
Why It Matters
For Netflix, playback quality is everything. A single glitch can erode trust across millions of viewers. By scaling observability to trillions of rows and millions of events per second, Netflix detects problems in real time, protects viewing experiences worldwide, and continues innovating rapidly without sacrificing quality.
Netflix ensures streaming quality for hundreds of millions of viewers with real-time observability built on Apache Druid. That same Druid query engine powers Imply Lumi, the industry’s first Observability Warehouse—delivering proven speed, scalability, and efficiency for your mission-critical workloads.
See how Imply Lumi can power observability at scale.