Netflix Delivers Real-Time Observability for Playback Quality

logo-Netfilx-case-studies

Overview

Netflix is the world’s leading streaming platform, serving hundreds of millions of viewers across devices, regions, and networks. Its mission: provide a consistently smooth, high-quality viewing experience while continuously shipping new features and updates at global scale.

Challenge

Delivering uninterrupted streaming quality is critical to Netflix’s brand reputation. As Ben Sykes explained, “Ensuring a consistently great Netflix experience while continuously pushing innovative technology updates is no easy feat.”

The scale of the problem was immense. Playback devices generate over 2 million events per second — more than 115 billion rows of data every day. Viewers stream on Smart TVs, mobile devices, consoles, and browsers, each with unique software versions and network conditions. At the same time, Netflix pushes frequent app and platform updates, which must be validated in real time to avoid regressions. Traditional monitoring systems couldn’t keep pace with this volume or deliver the low-latency insights Netflix required.

Solution

To meet these demands, Netflix built a real-time observability platform using Kafka for event ingestion and Druid for analytics at scale. Real-time logs from playback devices capture startup times, buffering rates, error codes, and other performance signals, which flow through Kafka into Druid.

With Druid’s subsecond querying, engineers can drill into trillions of rows in milliseconds, gaining instant visibility into playback metrics across devices, regions, and versions. Controlled rollouts allow new app versions to be tested against baseline metrics, with automatic rollbacks triggered if regressions are detected. Continuous anomaly detection ensures that issues are identified and resolved before they impact large audiences.

“Using real-time logs from playback devices as a source of events, we derive measurements in order to understand and quantify how seamlessly users’ devices are handling browsing and playback.”
Ben Sykes, Netflix

Results

  • Scale — Ingests 2M+ events per second and processes 115B new rows daily
  • Speed — Queries return in tens of milliseconds, even across trillions of rows
  • Granularity — Device, region, and version tags isolate issues with precision
  • Reliability — 100% uptime during global streaming peaks ensures uninterrupted experiences
  • Confidence — Engineers ship features faster, knowing regressions will be caught instantly

“Every measure is tagged with anonymized details about the kind of device being used… enabling us to isolate issues that may only affect a certain group, such as a version of the app, certain types of devices, or particular countries.”
Ben Sykes, Netflix

Why It Matters

For Netflix, playback quality is everything. A single glitch can erode trust across millions of viewers. By scaling observability to trillions of rows and millions of events per second, Netflix detects problems in real time, protects viewing experiences worldwide, and continues innovating rapidly without sacrificing quality.

Netflix ensures streaming quality for hundreds of millions of viewers with real-time observability built on Apache Druid. That same Druid query engine powers Imply Lumi, the industry’s first Observability Warehouse—delivering proven speed, scalability, and efficiency for your mission-critical workloads.

See how Imply Lumi can power observability at scale.

Ready to decouple your observability stack?
No workflow changes. No migrations. More data, less spend.

Request a Demo