How Sift is accurately identifying anomalies in real time by using Imply Druid

About Sift

Sift is the leader in Digital Trust & Safety, empowering companies of all sizes to unlock revenue without risk. Sift prevents fraud with industry-leading technology and expertise.

Challenges

As the leader in Digital Trust & Safety and a pioneer in using machine learning to fight fraud, Sift regularly deploys new machine learning models into production. Sift’s customers use the scores generated by machine learning models to decide whether to accept, block, or watch events and transactions. An example could be blocking all events with a high risk score. It’s very important that the new ML model releases do not cause a shift in score distributions. Score shift can cause customers to suddenly start blocking legitimate transactions and accepting fraudulent ones.

Sift’s customers have also built thousands of automated workflows to automatically determine whether to block, accept or watch a specific transaction or an event. Even a slight change in the score distributions may impact decision rates or introduce anomalies.

These anomalies in decision rates can be caused by internal changes in models and system components. They can also be caused by changes on the customers’ side: a change in integration, or decisioning behavior. Sometimes a change in decision rates is desirable – such as when there’s a fraud attack, entering into a new market, or a seasonal event.

The most important thing is to identify and triage changes in decision rates immediately in realtime to ensure that customers continue to get accurate results with Sift.

Sift needed a scalable product for monitoring customer outcomes in real time, knowing that each customer has unique traffic patterns and risk tolerances.

Solution requirements and decision to go with Imply’s distribution of Druid

The solution had to meet the following requirements:

Accurately identify anomalies per customer level in realtime
Automated real time alerts
High availability
Support realtime root cause analysis

Because each customer has unique traffic and decision patterns, Sift needed a tool, which can automatically learn what “normal” looks like for each customer.

Sift set out to build an automated monitoring tool, Watchtower, a system that would use anomaly detection algorithms to learn from past data and trigger alerts in realtime on unusual changes.

To allow them to transition to real time anomaly detection, Sift chose to start with open source Druid on AWS.

But soon they started to run into deployment and upgrade related issues on Open Source Druid. Sift then decided to rely on Imply’s distribution of Druid for deployment, upgrade and overall cluster management.

Solution Architecture

Watchtower architecture can be categorized in below mentioned four components:

Scalable tooling for real time data collection
Sift built a library to intercept requests to service, reformat and send data to a distributed messaging system. They used Kafka because of its scalability and fault tolerant characteristics.
Data storage for aggregating and querying streaming data in near real time
Sift needed to aggregate data by a variety of dimensions from thousands of servers. They needed a system that allows querying across a moving time window with real time analysis and visualization.
They used Imply Druid for this realtime analysis and visualization, which in their case proved to be a good choice as a near-realtime OLAP engine.
Services that ingest aggregated time-series data, run the anomaly detection algorithm, and generate reports
Sift needed a service that can fit the following requirements:
- Run jupyter notebooks with dependencies in virtual environments
- Pull and cache time series data
- Snapshot and store input time-series data for future investigations and algorithm tuning
Tooling for Offline Training, Modeling, and Benchmarking Algorithms
The algorithm must learn from unlabeled data, but they want to evaluate it against known historical anomalies. This helps them tune the sensitivity parameters of their algorithms, and provides context for support engineers. To do this, they built tools to join data from multiple data sources and run backtesting on algorithms. Once the algorithm is tuned, it can be deployed into the model shelf.

User Interface – Imply Pivot

Sift uses plots generated by anomaly detection notebooks for alerting via various channels. Each alert contains the name of the customer with anomaly in their decision pattern, the metric which triggered the alert, and a link to a dashboard that updates in realtime.

Their support engineering teams use Imply Pivot to build out Dashboards and slice and filter data per decision type, product type, time of the day, etc. Support engineering team monitors statistics on decisions and score distributions.

Here’s an example of a dashboard for one of Sift’s internal test customers presented below.

Imply Pivot – Visualizing sub-second latency analytics

Benefits

From a business standpoint, Watchtower running on Imply Druid with Imply Pivot exceeded Sift’s expectations. In the first month of launch, they were able to detect several possible anomalies without human intervention. They analyzed the data and saw a variety of root causes such as:

Incorrect integration changes with Sift REST API on one of the customers side
A mis-calibrated model for one of their customers before they released the model
Severe fraud attack on one of their customers
Spikes related to expected change of patterns such as promotion event ad campaigns

With Watchtower running on Imply Druid, their support engineering team was able to proactively contact customers quickly when anomalies were spotted, which prevented any potential business impact for their customers. This became possible with Imply Pivot, which allows the data to be analyzed in real time thereby reducing the time to resolve issues.

Next Steps

Sift’s next steps will be focused on adoption of new use cases as well as improving anomaly detection algorithm performance. They have started testing a number of promising deep neural network algorithms, including variations of LSTM and CNN.
They plan to use Watchtower for new types of data besides the decisions, both in business and engineering. For example, they plan to use it to monitor score distributions and system loads.
They also want to make Watchtower a self-serve service , where engineers without machine learning and data science backgrounds can use Watchtower for anomaly detection in any type of application.

Other blogs you might find interesting

No records found...

Jun 16, 2025

10 Years of Imply: From Apache Druid to What’s Next in Real-Time Analytics

We’ve officially hit double digits! Ten years ago, a few Druid-obsessed engineers asked a radical question: What if analytics didn’t have to be slow, stale, or stuck in dashboards? Back then, we were just...

Learn More

May 07, 2025

Real-Time Observability Without Operational Overhead: What’s Next?

Observability is meant to provide clarity, speed, and confidence in modern systems. Yet for many organizations, it has become a source of complexity, cost, and operational drag. Managing pipelines, tuning...

Learn More

Apr 14, 2025

It’s Time to Rethink Observability: The Event-Driven Future

Observability has evolved. Forward-looking teams are already moving beyond static dashboards and fragmented telemetry—treating all observability data as events and unlocking real-time insights across their...

Learn More

By Functional Use

By Application

FEATURED

DRUID CASE STUDIES

Apache Druid

Content

Support

Other blogs you might find interesting

Let us help with your analytics apps