Part 3: The technology behind operational analytics

by Fangjin Yang · August 16, 2018

This is the third blog post in our series describing operational analytics. The first post covered the need for operational analytics, the second post covered how operational analytics is used in practice, and this post will cover the technology behind a good operational analytics solution.

A good operational analytics solution should solve two primary technical challenges. The first challenge is to be able to handle high volume and complex event-driven data. The second challenge is to be able to rapidly explain trends and patterns in data. Solving both challenges enables users to be able to quickly take correct actions based on data insights.

Sources of data

Operational data can come from many sources: it can be generated by users interacting with digital products (clickstreams), it can be created as the digital products emit KPIs (APM metrics), or it can be data generated by the underlying infrastructure powering the product (server metrics/logs, netflows, etc). The data can be extremely high volume (millions of events per sec), very complex (lots of dimensions, high cardinality, no fixed schema), and continuously produced (streaming data). In modern data architectures, raw data generated by users, applications, or machines is stored on either a message bus or a file system after it is created. File systems, often called data lakes, consist of technologies such as HDFS, Amazon S3, Azure Blob Storage, Google Cloud Storage, and many others, are used to store static files. Message buses such as Apache Kafka, AWS Kinesis, and many others, are used to store and transport streams of events. From these storage systems, raw data is either processed (cleaned or transformed as part of an ETL process) or sent to downstream systems for further analysis.

Diagram of different products feeding information into Druid

A good operational analytics solution should be able to integrate directly with either message buses or filesystems to load raw and processed data. The system also needs to be able to handle evolving schemas, nested data, and other semi-structured formats of operational data. Furthermore, many types of operational data, such as netflows or server metrics, can be generated at an extremely high rate. This data can’t be sampled as dropped events might mean missed anomalies or other important details. Thus, a good operational analytics solution needs to be horizontally scalable to be able to ingest real-time data at any scale.

Explaining patterns

The second component important for operational analytics is about explaining data patterns. Often, data patterns cannot be explained with a single query. To root cause an issue, or to broadly understand a trend, multiple queries are required. Each query progressively narrows the amount of data being examined, until a meaningful insight is found. A good operational analytics system should support this workflow by enabling iterative slice and dice queries. Furthermore, each query should complete in less than a second. If each query took minutes to complete, operators would constantly lose their concentration, and would not be able to chain queries together to form an understanding of data.

Druid

Druid is an open source solution developed by Imply from the ground up for operational analytics. Druid has a hybrid architecture that combines the best of search platforms, time series databases, and OLAP systems. Like many search platforms, Druid can ingest structured and semi-structured data. Druid can also create search indexes to quickly find relevant subsections of data. Similar to time series databases, Druid is highly optimized for data with a time stamp with a number of time-based functions and queries. Like many OLAP systems, Druid stores data in a column orientation to enable rapid numeric aggregation, and can support complex, multi-dimensional groupings of data.

Diagram of 3 Search Platform, OLAP and TSDB to Druid

By combining logsearch and OLAP capabilities, Druid can not only support complex analytic and search queries, but it can leverage both architectures to ensures all queries complete in less than a second. Druid provides both the ingestion and queries capabilities required for operational analytics. To learn more about Druid, and for a more comprehensive dive into Druid’s architecture and storage format, please read this white paper.

Visualizing and interacting with data

Druid is a powerful engine for operational analytics, but it isn’t a complete solution on its own. Operational analytics in practice is useful to many teams in an organization. User engagement data is useful to lines of business and product managers. Application data is interesting to developers and devops engineers. Machine data is useful to various operators and support staff. These users are not always technical, but can still benefit from the insights in this data. While many traditional BI tools focus on presenting static reports and visualizations to these same users, they are not catered for the operational analytics workflow. An ideal application should easily enable the rapid, iterative workflow behind operational analytics.

At Imply, we’ve created an application layer specialized in operations such as root cause analysis, sensitivity analysis, top K/heavy-hitter queries, behavioral analysis, and much more through a simple point-and-click interface. Our application aims to provide self-service data access and provides operational analytics to all levels of an organization.

Back to blog

How can we help?