Overview
As applications become more complex and mission-critical, so too have user expectations grown. Applications must scale seamlessly to meet demand, whether it’s a travel booking site preparing for the holiday rush or a streaming media provider delivering content to viewers during peak hours.
Traditional application architectures were monolithic, consisting of a single code base that unified business functions such as search, checkouts, and customer information. While monoliths are easy to deploy and monitor, they are less scalable, not as resilient, and difficult to maintain.
As a solution, architects and developers created microservices architectures, consisting of an interlinked web of independent components. An ecommerce storefront would utilize separate services for search, shipping, and checkout, all of which are connected by streaming data and APIs. Each service can also be scaled, deployed, updated, and maintained without affecting other parts of the application.
However, an application stack with more moving parts also has more potential for error. In order to either prevent possible errors or to quickly resolve any that arise, DevOps teams need to monitor their environment. While an off-the-shelf product could provide good enough visibility, organizations with more complex architectures (and more resources) may prefer to build their own observability platform—especially if they require more flexibility or deeper detail into metrics and other data.
In fact, it is not only DevOps teams who will find value in observability. Sysadmins need to identify underperforming endpoints or components for replacement, maintenance, or upgrades—ideally before seeing effects on downstream systems and end users. Analysts and customers need to dig into metrics to better understand key data, inform decision making, and assess the effectiveness of campaigns and initiatives. Security administrators need to monitor and act on large streams of information, reacting in seconds to fraud, intrusions, and other threats.
Requirements
A DevOps team racing to restore service has to rapidly investigate huge amounts of data—often without knowing what they are really seeking. This means that they can’t predefine aggregations or prepare their data schema in advance—required by some platforms to ensure faster queries. Instead, they need to explore their data interactively: drilling down, executing ad hoc queries, filtering by dimensions, and comparing real-time data and historical performance side by side.
Speed is always important. A timely resolution (or, even better, a preventive action) can mean the difference between fulfilling or failing service-level agreements (SLAs) for uptime, reliability, and other criteria. Any delays in restoring service can also lead customers to simply switch products, such as choosing a rival ride-hailing service or another online retailer. Repeated performance issues also impact the reputation of an organization, putting shareholders and analysts on guard.
For end users, fast results are essential for efficiency in reporting and analysis. Gone are the days of waiting hours or days for graphics to render and presentations to be completed; instead, organizations that can quickly analyze and act on data have a competitive advantage. Rapid understanding of insights shortens decision cycles and allows organizations to pivot from unsuccessful plans—or to double down on successful initiatives.
Cost at scale is another concern. Some monitoring platforms charge by host (usually a single server or virtual machine within your environment) and/or host hours (generally the amount of data each host consumes over a specific period of time). If an organization is running hundreds of virtual machines with terabytes of data in throughput, this equates to lots of hosts and host hours—as well as higher monitoring costs. In some cases, it may be less expensive for them to build their own observability solution with an analytics database designed to maximize resource efficiency.
Further, not all technologies perform well under load, as an influx of simultaneous queries or users can cause slowdowns. For instance, analytical databases (such as cloud data warehouses) will return results in minutes or hours (rather than seconds) when faced with such high concurrency.
Solution
Given these requirements, many teams use Apache® Druid as the foundation of their in-house observability platforms. Druid is optimized for large volumes of high cardinality and high dimensional streaming data.
The key to Druid’s success lies in a crucial design principle: maximum results for minimum effort. Druid will actively seek out computational efficiencies, such as avoiding loading data from disk to memory, reading data indexes instead of entire datasets, or operating directly on encoded data instead of decoding data first. As a result, Druid can support users and queries at massive scale—after all, each click on a dashboard or heatmap is a query on the backend.
For increased efficiency, Druid also reimagines the relationship between storage and compute, tightly integrating the two to minimize operational overhead. Druid accomplishes this by partitioning data into segments (which are automatically rebalanced in case of failures or scaling), using indexes and dictionaries to identify the segments necessary for specific queries, and then devolving the bulk of the computation work to individual data servers.
In addition, Druid can simultaneously manage real-time and historical data without sacrificing speed or efficiency, ensuring that teams can access and compare key data in one place without switching between platforms and breaking their concentration. This ability not only enables use cases like anomaly detection, but also shortens mean time to resolution for teams.
Customer Examples
Netflix is one of the leading content providers today, with over 200 million customers across 190 countries enjoying nearly 250 million daily hours of TV and movies. The Netflix app is installed on 300 million devices across four major UIs and hundreds of different devices, complicating efforts to quantify performance in areas such as browsing or playback.
In order to monitor their customer experience at massive scale, Netflix chose Apache® Druid. “We’re currently ingesting at over 2 million events per second, and querying over 1.5 trillion rows to get detailed insights into how our users are experiencing the service,” Senior Software Engineer Ben Sykes explains. Altogether, Sykes estimates that Netflix generates approximately 175 billion events per day and processes around 2 million events per second.
After data is ingested and tagged with anonymized details (such device type, like smartphone or tablet), Druid aggregates the data and makes it available immediately for querying and visualization. “Druid can make optimizations in how it stores, distributes, and queries data such that we’re able to scale the datasource to trillions of rows and still achieve query response times in the 10s of milliseconds,” Sykes continues. This unique storage structure speeds up queries, anomaly detection, and resolution times, ensuring that teams are able to stay ahead of problems—or find and fix them quickly. Ultimately, Netflix strives to provide the best experience to customers regardless of device, traffic, or many other factors—a key business advantage in the fiercely competitive streaming media market.
For more information on how Netflix uses Druid, check out this blog post or download this ebook.
To learn more about Druid, read our architecture guide. For the easiest way to get started with real-time analytics, start a free trial of Polaris, the fully managed, Druid database-as-a-service by Imply.