The analytics space is filled with many technologies, each focused on solving a different piece of the myriad of problems that comes with working with data at scale. While many products tackle data transport, storage, compute, or visualization, few focus on explaining why trends or anomalies occur.
Imagine you are monitoring network traffic in a major enterprise and you suddenly detect a massive spike in network latency. The most immediate question that comes to mind is “why?”. This anomaly may have many root causes. Perhaps one of your network devices is down. Perhaps the spike is caused by heavy usage from a single user. Perhaps the spike occurred right after you rolled out a new version of a network application. You know you must take action because your operations are impacted, but each of these different situations requires a very different response. Finding the source of the problem isn’t always obvious, and often requires asking many questions of the data to find the root cause. This is the crux of one of the most difficult problems in the data and analytics space: to be able to quickly understand why something is happening, and take the right action based on the insight.
Challenges of Working with Data at Scale
Simply asking questions on high scale data can present some difficult challenges. In the example of network flows, there are numerous metrics you have to track just to detect if there’s a problem. These metrics can be simple, such as packet count, bytes per second, or number of flows, or they can be complex, such as quantiles on flow performance, or unique active users per network device. Network data also includes many attributes that are important in any type of analysis, such as device location, port, protocol, source IP, and destination IP. With so many different attributes and metrics, it can be extremely difficult to determine what combination of things is actually contributing to a spike. Furthermore, the data may have no defined schema, may be very complex (it has high dimensionality and cardinality), and may be continuously produced at a rapid rate.
Explain and Understand Patterns
An ideal operational analytics solution should not only handle these data challenges, but also be able to quickly explain why any trend is occurring in the data. This requires the solution to be able to answer questions in rapid succession. For example, to diagnose a network issue, you may first look at how different apps on the network are contributing to the traffic. If you find that the network spike is caused by a certain app, you’ll need to drill into the app to examine its usage. Perhaps you’ll find that one app has many concurrent users using it, and one particular user is generating substantially more traffic than the others. Further examination into what the user is doing leads you to find the root cause of the issue: one user is streaming a lot of multimedia content. An ideal operational analytics solution should make this workflow easy.
Operational Analytics is a Different Workflow
Operational analytics is a workflow that is distinct from business intelligence or log search, two other approaches commonly applied to operational data. Operational analytics focuses on rapidly answering questions about data to explain why certain patterns exist. Business intelligence workflows focus on showing what has happened through static reports. Log search workflows focus on quickly finding data points that match a particular pattern.
The ability to explain data patterns has many applications beyond root cause analysis of netflows. In the next part of the series, we will cover the main use cases of operational analytics in practice.