A database for
product analytics

When trying to understand user behavior in real-time,
look for an analytics database built for it.

Overview

Customer feedback is a critical ingredient for product growth—and organizational success. But in an increasingly digital world, where users and providers are separated by vast distances, soliciting this feedback isn’t always straightforward. Traditional tools, like surveys and focus groups, can only go so far.

To bridge the gap between customer opinions and organizational perceptions of a product, many teams mine user behavior data. By using both direct interactions (website clicks or mobile app swipes) and contextual metrics (bounce rates and loiter times), an organization can create a holistic view of user patterns, better understand product weaknesses and strengths, and build a better experience.

Analyzing this user-generated data can benefit a wide range of sectors. For instance, an ecommerce retailer can understand why shopping carts are being abandoned, whether it’s due to an unwieldy checkout process or a malfunctioning backend service. A marketing team can figure out why and how specific pieces of content convert well, and try to replicate the magic behind the popular assets. An online travel platform can provide personalized recommendations—for flights, hotels, and destinations—to users in real time.

Before they can extract actionable intelligence from their users, organizations first require a database that can efficiently collect, organize, and process raw product data. One challenge is speed and volume; while this can vary based on the popularity of a product, an organization with lots of users and heavy traffic has to filter, aggregate, and analyze significant quantities of data. After all, each unique user session can generate hundreds of events; when compounded by thousands of users, the number of events per second can quickly balloon into the hundreds of thousands (or even millions).

In addition, some product insights also have to be delivered instantaneously, or risk going stale and forfeiting potential sales or revenue. In these situations, shortening the time to value is essential. For instance, a shopping suggestion might only be relevant during this particular browsing session, such as a customer seeking baby gifts for a baby shower, despite not having children themselves. Alternatively, a customer might be planning a last-minute vacation to a foreign destination, and a travel platform needs to tailor offers to this individual (such as points for a loyalty program or a hefty discount) to encourage an immediate purchase.

Requirements

To build the best possible product and experience, whether it’s through buyer recommendations or A/B testing new features, teams need to create a real-time analytics application to collect, process, and pull insights from their user-generated data. They will need a database that can: 

Ingest, organize, and query behavioral data in real time and at scale. Because this data is both voluminous and perishable, teams need to automate the process to extract value. Therefore, data needs to be available instantly for querying, and results have to be returned rapidly to teams and applications. Raw events also have to be filtered for bot-generated activity, such as web crawling or hoarding inventory.

Answer ad-hoc questions across large, high-cardinality datasets. Many of the relevant uses of user behavior data (such as personalization) can require complex processes that execute a lot of open-ended queries. As a result, teams may not be able to prepare and pre-aggregate data ahead of time for querying, so they will need databases that can enable flexible exploration across a wide range of dimensions, filters, and aggregations.

Provide always-on reliability and durability. Successful organizations may have a highly active, globally-distributed user base active at all hours. As such, any downtime (from maintenance, failures, or upgrades) or data loss will negatively impact personalization or optimization initiatives.

Access both immediate and historical data. To build a better understanding of user interaction patterns, it helps to analyze both real-time and historical data in a single platform. In the past, these two types of data were split between two types of databases, transactional and analytical, which had distinct strengths and no overlap. Transactional databases were built for rapid data access and speed under load but not complex analysis, while analytical databases could perform complex aggregations but could not scale easily, return results quickly, or accommodate high user traffic.

Solution

Built for scale, speed, and streaming data, Apache Druid provides an excellent foundation for building a real-time analytics application that can leverage user-generated data for product analytics. In fact, its roots lie in user engagement data, as Druid’s founders were engineers at an advertising technology and market intelligence company.

To begin, data in Druid is immediately available for querying. While other databases ingest data in batches and persist them to files before users can access them, Druid ingests streaming data by event directly into memory at data nodes, so that users can query data on arrival. This ensures that time-sensitive data remains relevant—and can be utilized before going stale.

In addition, Druid can power rapid, open-ended data exploration. Teams can drill down along a wide range of attributes, segmenting data by age, gender, location, user preferences, purchasing patterns, and any number of key characteristics and dimensions. For those who want more customization options, Imply also provides Pivot, an engine to create interactive visualizations such as heatmaps, bar charts, choropleth maps, stack areas, and much more.

Druid also includes built-in reliability and resilience. After ingestion, events are sorted into columns and segments before being persisted into a separate deep storage layer, a common data store that guarantees data durability and availability. If a node fails, Druid will automatically pull a copy of the data from deep storage and rebalance it across the other, still-functioning nodes, ensuring that data is never unavailable. 

Deep storage also facilitates scaling. While Druid divides key processes (like data, querying, and ingestion) into independent nodes for easier scaling, it also relies on deep storage for flexibility in the scaling process. For instance, if a data node is added to keep up with increased traffic, Druid will obtain the relevant data from deep storage and rebalance it across the new set of nodes to ensure that performance remains consistent.

Lastly, Druid manages both real-time and historical data in a single platform. Rather than maintaining the divide between transactional and historical databases, Druid provides the best of both worlds—the fast scaling and performance under load of transactional databases with the complex aggregations and operations of analytical databases. By removing the friction of jumping between platforms, Druid creates a better experience for teams managing behavioral data, enabling them to more easily build analytics, extract value, and optimize experiences.

Customer story

WalkMe provides a no-code digital platform for product analytics and personalization. Their embedded application integrates into any web, mobile, or desktop application, customizing experiences and converting users based on their activity (and predetermined rules). Today, WalkMe has approximately 2,000 customers (comprising 31% of the Fortune 500 Index), impacting a combined total of 35 million users across 42+ nations.

As they grew, WalkMe needed a database that could manage user-generated data in real time and serve instantaneous insights to their large customer base. Initially, WalkMe used Elasticsearch’s log management solution to troubleshoot problems in their embedded application—rather than to visualize and analyze broad customer trends. However, as their data evolved in volume and complexity, so too did the WalkMe team’s needs change: now, they required a database that could accommodate ad-hoc queries with groups, filters, and dimensions, as well as scale to keep up with demand.

The solution was Apache Druid. “Druid enables us to monitor performance across billions of client devices in real time,” Yotam Spenser, WalkMe’s Head of Data Engineering, explains. “We can leverage Druid to compute any arbitrary metrics over any ad-hoc groups of users. We can track business critical measures such as retention and attrition, plus many other forms of engagement and usage metrics. As a result, we can now gain the type of insights we need to optimize and segment our code for different host platforms, applications, and websites, per their specific needs.”

To learn more about how WalkMe uses Apache Druid, read this blog post.

For more information about Druid, read our architecture guideFor the easiest way to get started with real-time analytics, start a free trial of Polaris, the fully managed, Druid database-as-a-service by Imply.

Let us help with your analytics apps

Request a Demo