How WalkMe uses Druid and Imply Cloud to Analyze Clickstreams and User Behavior

by Yotam Spenser · April 3, 2019

This is a guest post from Yotam Spenser, Head of Data Engineering @ WalkMe

WalkMe is a Digital Adoption Platform (DAP) pioneer that offers a 360-degree solution to leading organizations worldwide. WalkMe helps employees and customers at some of the world’s largest companies engage and adopt digital products, and ensures organizations of all sizes can undergo smooth digital transformations.

WalkMe works as an embedded solution that integrates into any web, mobile, or desktop host application and works seamlessly from within to engage, support, or convert users based on their behavior (combined with predetermined rules).

A key element of any intelligent embedded application is that it must monitor its own effectiveness (ideally, as discreetly as possible). This means any tracking should remain effectively invisible, even as the embedded application undergoes rapid evolution.

monitoring screenshot

A WalkMe step-by-step Walk-Thru in action.

Migrating from ElasticSearch to Druid

The legacy analytics system that was originally used to track core product usage was Elasticsearch (ES), which we initially leveraged as a simple log management system. We primarily used this system for one purpose: to troubleshoot specific problems in our embedded application (as opposed to monitoring broad trends across host applications).

While ES functioned well as a simple query system to troubleshoot errors in the embedded application, our requirements grew more sophisticated as our product matured. Over time, we started monitoring various aspects of the application’s performance and usage (e.g., length of operations on the client, accuracy of operations, etc.), and as the data grew in both volume and complexity, we realized an ES-based stack wasn’t optimal for real-time arithmetic operations over time-series data. Simply put, our queries were primarily ad-hoc analytic queries that grouped and filtered on several dimensions and aggregated several complex metrics. As our queries evolved from simple troubleshooting queries to ones that measured complex engagement stats, they became less and less well-suited to the search-focused architecture of ES.

Once we realized the legacy architecture was not well suited to behavioural analytics, and would not scale with our growth, we began searching for an alternative to transition our classic log search approach to a real-time OLAP one that scales linearly with our traffic. Druid met this criteria, so we started developing over the open source solution, testing specific use cases as well as load management in a cluster.

Productionalizing Druid

We quickly realized through our early evaluation with Druid that complex distributed systems such as Druid require significant devops and fine-tuning in order to achieve optimal performance and resiliency. Early on, we partnered with Imply, whose founders developed Druid, allowing us to focus on our business objectives, which included monitoring and creating analytics applications. Today, we run our entire Druid cluster on Imply.

architecture diagram

High performance user analytics with Druid

Druid is now a big part of WalkMe’s internal and external analytics applications. Druid enables us to monitor performance across billions of client devices in real time. We can leverage Druid to compute any arbitrary metrics over any ad-hoc groups of users. We can track business critical measures such as retention and attrition, plus many other forms of engagement and usage metrics. As a result, we can now gain the type of insight we need to optimize and segment our code for different host platforms, applications, and websites, per their specific needs.

Run Powerful User Behavior Queries Interactively

There are some extremely powerful queries we can run with Druid, that would otherwise be impossible, to analyze the behavior of our users. For example, consider the following query:

SELECT ... FROM tbl WHERE userId IN (SELECT userId FROM tbl WHERE ...)

We issue many queries similar to this one where we first select a large group of users based on an ad-hoc set of criteria. We then further select a subset of these users and compute a wide range of metrics. For many traditional data warehouses, this can be a very expensive query. The first group of users may be in the millions, and to materialize and maintain such a list would be very expensive if we wanted the query to return interactively.

In contrast, Druid can leverage Bloom filters to return all results in a single query. This means that we don’t need to materialize the full set of users that match the first set of criteria. This is extremely powerful because it allows us to interactively engage with the data cost effectively, something we couldn’t do with ES. Although Bloom filters are approximate, we’ve found they have a high degree of accuracy and allow us to see all the important trends we need to follow. If we need 100% accuracy on a few key results, Druid also supports exact computations.

Bloom filters

How Bloom filters work.

Conclusions

WalkMe is now using Druid as a part of our flagship customer-facing analytics product, WalkMe Insights, as well as internally for all of our monitoring purposes. If you have problems similar to ours around tracking users for digital products, we encourage you to evaluate Druid and Imply.

Back to blog

How can we help?