Druid: Reflections at a Milestone

by Gian Merlino · December 10, 2019

Druid

Today, Imply announced a new round of growth funding raised from Andreessen Horowitz, Khosla Ventures and Geodesic Capital. First I want to thank these firms for their support and for sharing our vision as we continue to grow. Thank you, too, to all of our customers for joining us on this journey.

A milestone like this is a good time to reflect on the past and the future. Imply is a company built around Apache Druid, a project which is currently incubating at the Apache Software Foundation and was created to make true real-time analytics possible. It was originally developed at a startup called Metamarkets to power an all-in-one analytics solution for programmatic digital advertising. Ad-tech is an interesting market because even small players generate mountains of data, and large players generate almost unimaginable amounts, to the tune of hundreds of billions or even trillions of new records per day. Growing up in this world means that Druid can offer world-class efficiency and is battle-tested at scale.

Since then, Druid has expanded to a broad set of use cases that aren’t adequately addressed by classic analytics stacks. Application areas that fall far outside its original starting point include network flow analytics, product analytics, user behavior, metrics, APM, security, finance, and plenty more. It is used and trusted by major companies both inside and outside tech, such as NTT, WalkMe, Pinterest, Netflix, Airbnb, Lyft, and Walmart.

Real-time analytics

Real-time analytics is what happens when you reimagine analytics for a stream-based, real-time world. There are two sides to this coin: ingest and query. They can stand alone: real-time ingestion by itself helps simplify data pipelines, while real-time querying by itself powers interactive, exploratory workflows. But they are much more powerful together. When put together, the combination means you can do more ad-hoc data exploration and monitoring, versus simple reporting.

Let’s unpack these for a minute. By “monitoring”, I mean keeping an eye on real-time data. They tend to start with a dashboard view designed to show a birds’-eye-view and quickly surface issues. These crop up in a wide variety of areas: real-time dashboards are used to monitor supply chains, software rollouts, application performance, and revenue and spending.

By “data exploration”, I mean a workflow where you don’t know exactly what you’re looking for ahead of time. This is useful for diagnosis and optimization. In diagnosis, you’ve identified an issue (perhaps revenue is down or error rates are elevated) — the “what” — but you aren’t sure about the root cause. In optimization, you want to improve a metric, but you aren’t sure how (perhaps you want to find opportunities to grow revenue or improve user engagement). In both cases, you need your analytical system to confirm or rule out your guesses, and you need it show you some additional possibilities by examining the problem from as many angles as possible. Exploratory workflows are iterative and require repeated cycling through question and answer, so it’s critical that answers come as quickly as possible. A 20 second delay slows you down considerably, and a delay of minutes leaves you at a huge disadvantage.

Druid is most powerful when you use it to blend monitoring and exploration. Monitoring tells you that you have a problem or an opportunity, and exploration gets you to “why” so you can fix your problems and exploit your opportunities. By ratcheting the speed of each of these as far as they can go — sub-second ingest and query latencies — you can take maximum advantage of your data.

We aren’t the first to realize the importance of these blended real-time workflows. In fact, Druid has some similarities to an internal product at Facebook called Scuba. They aren’t identical technically; unlike the version of Scuba discussed in Facebook’s paper, Druid is column-oriented rather than row-oriented, which is well known to improve compression and performance. But in terms of architecture and use cases there is a lot of overlap. The Scuba paper discusses key use cases in studying user behavior, supporting performance analysis, mining for patterns, and looking at trends — all instances where real-time analytics is important. Druid is commonly used for all of these things (and more).

Unlike Scuba, Druid’s tiered storage architecture and column-orientation make it a good fit for historical data as well. Bringing in historical data is important since “reporting” is really just the historical version of monitoring, and this completes the trio of analytical use cases: exploration, monitoring, and reporting. Druid views real-time and historical data similarly to each other and can handle both. The main difference under the hood is that Druid offers tiered storage and time-based rollup features to enable storing large amounts of historical data at a substantially lower cost.

Imply Pivot

Imply Pivot, if you haven’t seen it before, is an analytics application built by Imply. It’s an application for exploration, monitoring, and reporting, in roughly that order. We built Pivot because we saw that the world had plenty of reporting apps — business intelligence tools do a pretty good job of this — but was lacking an application that supported fluid exploration and real-time monitoring in a scalable way. Pivot’s core experience is designed to power these sorts of workflows.

Pivot has some surface similarity to other analytics apps you may have used: it lets you filter, group, and visualize your data based on attributes you define. But the similarities end there. Pivot is designed to run on top of a real-time data engine like Druid and offers the richest experience possible. Every visualization is interactive and everything you see in the viewport is capable of drag-and-drop interaction. Everything you do in Pivot instantly leads to a fresh query. And yet, thanks to Druid, Pivot can scale to incredible amounts of data. The largest Pivot installations out there are running on top of thousand-server-plus Druid clusters, and still provide sub-second average response times even with thousands of users on the system. This is an extraordinary technical challenge, and in a very real sense, the purpose of Druid is to make applications like Pivot work.

Of course, even though real-time exploration and monitoring workflows are incredibly powerful, classic reporting is still quite useful. To complete the picture, we built reporting into Pivot too. It has a dashboard feature that can be used for either monitoring or reporting. It lets you query a blend of real-time and historical data. It lets you write SQL and send it directly to Druid. It offers the ability to download the results of any query.

The best analytical systems get you to the moment of insight as fast as possible. Together, the combination of Druid and Pivot — engine and application — will get you there.

The need for new tools

Why not use data warehouses and business intelligence tools to build these real-time experiences? This classic toolset is ubiquitous and seems to have a lot in common with Druid-based stacks: you can use data warehouses to run analytical queries and you can use BI apps to create visualizations.

The simple answer is that the classic toolset was not designed with interactive exploration and monitoring in mind. They were built for a world where datasets are loaded overnight, where most queries are written by analysts that can afford to wait five minutes for a response, and where the goal of most users of data was to prepare a report for someone else.

Many data warehouses on the market struggle with real-time ingest, or query, or both. This is especially true with larger amounts of data and larger numbers of concurrent users. Time after time, our customers tell us that Druid lets them run their existing apps better and more smoothly, and gives them the power to dream up new apps they never thought possible.

That being said, the future of Druid and of analytics in general is quickly becoming a much more interesting place. The lines between Druid and data warehouses are already starting to blur. Just like data warehouses, Druid speaks SQL. With the addition of a JOIN feature to Druid, many classic data warehousing use cases will become possible in Druid. Official support for Druid from popular applications like Looker and Apache Superset means that traditional business intelligence workflows can work on Druid, too.

Next time you find yourself needing a system capable of more than the classic analytics stack, I encourage you to check out the Imply distribution, which includes both Druid and Pivot, and is available at https://imply.io/get-started either for download or as a cloud-hosted service on AWS (other public clouds coming soon).

Back to blog

How can we help?