As businesses move more and more of their processes online, operational visibility has become increasingly critical to success. As they digitize their operations, however, organizations across sectors like manufacturing, transportation, and retail have to contend with a torrent of data from sources such as IoT devices, checkouts, and more.
This data usually takes the form of events such as clicks, telemetry, logs, and metrics, which can be collected as time series data or machine data. In contrast to transactional data, traditionally the most common form of data in applications, time series data is often collected en masse as real-time streams. Further, time series data needs to be analyzed quickly in order to extract maximum value, as data expires quickly.
Streaming data has revolutionized several sectors. For instance, SRE teams now have more data and insight into how, where, and why issues arise—leading to more efficient operations on the back end and a better user experience on the front end.
Teams in many industries can also automate more of their workflows, setting triggers in response to certain events, such as selling stocks if prices rise above or fall below a specific threshold, or replenishing inventory if product stock drops past a certain limit.
Lastly, executives and analysts can now easily discover what does (and doesn’t) work, from new product launches to feature updates. By capitalizing on real-time data, their organizations can quickly decide, act, and stay ahead of the competition.
To make the most out of event data, engineering teams need the right database platform. Although this form of data has existed for almost a decade, even today few databases are optimized for collecting, storing, querying, and analyzing streaming data at speed or scale.
The first requirement is real-time pipelines. Traditionally, data pipelines used the batch model, where data was fed in through an intake, cleaned up, organized, and stored in files before being ingested into an analytics database for analysis and querying. However, this batch model is unsuitable for the speed and immediacy of real-time data, as it risks making it stale and losing out on important, time-sensitive insights.
The solution is streaming pipelines, typically Apache Kafka or Amazon Kinesis. Both technologies are designed for high throughput and low latency, handling up to millions of events per second smoothly.
Size and concurrency are another consideration. By nature, streaming data often comes in large quantities, anywhere from thousands to millions of events per second, equating to trillions (or more) of events per week or month. In addition, analytics applications must handle many queries and users at once—a crisis situation may require multiple teams to coordinate, or an organization may have to provide data to customers on demand.
Teams may also need to compare historical and recent data side by side in order to identify anomalies. Few databases offer this feature, however, forcing teams to switch between a transactional database and a cloud data warehouse—an unwieldy process that risks delaying results when an organization can least afford it.
In fact, transactional (OLTP) and analytical (OLAP) remain the most common database paradigms today. Either teams use OLTP databases for fast access and excellent performance under load (high rates of queries and users), or OLAP software for multidimensional analysis. However, OLTP products cannot perform analytical operations or scale massively, whereas OLAP databases cannot move at speed and are limited to a handful of users or queries per second.
Any database should also be future proof. Migrations are one of the most difficult and time consuming projects for data teams, stemming from a mismatch between a data architecture and an application’s evolving requirements. Perhaps the team chose a less scalable solution without anticipating massive growth, or a less flexible platform without taking into account the need to take systems offline for scaling, rebalancing traffic, or maintenance. Or perhaps the team needs to add intensive, interactive capabilities (such as analytics) to their application—but because their database was never designed to accommodate these new features, the team now encounters query latency and other performance issues.
As a result, there is a need for a database that can ingest, process, and analyze large amounts of real-time data while accommodating many queries and users.
Apache® Druid is that database. Built for speed, scale, and streaming data, Druid ingests data and makes it immediately available for queries, analysis, and other operations—no need to batch or prepare data beforehand. Druid can also power a wide range of features, including rich visualizations with drill down capabilities, real-time analytics applications, and much more.
Druid possesses several key abilities that make it ideal for streaming event data. The first is query on arrival. Many databases require events to be ingested in batches and persisted before they can be queried. Druid, however, ingests stream data by event directly into the memory on data nodes, where it can be immediately queried as needed. This data is later processed and persisted to deep storage for greater reliability.
Another important characteristic is guaranteed consistency. Duplicate records pose a huge problem, especially in a streaming context where databases ingest millions or billions of events per day. Thanks to its design and native Kafka support, Druid ingests events exactly once, ensuring that developers no longer need to worry about creating complex workarounds to identify and remove duplicate data.
Druid can also scale to keep up with the volume and velocity of streaming data. Druid splits up functions (such as querying and data management) into different nodes and automatically rebalances data as nodes are scaled up or down. This simplifies upgrades and scaling, ensures reliability and zero downtime, and provides the confidence to handle huge quantities of data.
Lastly, Druid also provides built-in data ingestion for Apache Kafka, removing the need for connectors or other workarounds for streaming data intake. In addition, Druid is also compatible with Kafka-based technologies, like Confluent Cloud. To learn more about how Druid and Kafka go together, check out this blog post.
Customer example: Confluent Cloud
Confluent provides managed solutions based on Apache Kafka, the leading open source streaming data technology. Confluent uses Druid as a foundation for many of their key services that provide operational visibility, including Confluent Health+, a notification and alerting product, Confluent Cloud Metrics API for providing metrics to customers and external users, and Confluent Stream Lineage, a graphical UI that enables users to explore event streams and data relationships.
Prior to Druid, Confluent used a NoSQL database to store and query data. However, as data grew in quantity, the Confluent team discovered that this option could not accommodate large amounts of high cardinality metrics. Their existing solution did not have native support for time series data, could not return subsecond queries under load, and did not scale to keep up with increased volumes of data and users. Druid also provided a native Kafka integration, which removed the need for an external connector or additional workarounds.
Today, Confluent runs over 100 historical nodes (for older, batch data) and around 75 MiddleManager nodes (for real-time ingest and querying) deployed on managed Elastic Kubernetes Service (EKS) clusters. Using this setup, Confluent’s environment ingests over three million events per second and responds to over 250 queries per second. Confluent also keeps seven days of queryable data in their historical nodes and two years of data retention in S3 deep storage.
To learn more about Druid, read our architecture guide.
For the easiest way to get started with real-time analytics, start a free trial of Polaris, the fully managed, Druid database-as-a-service by Imply.