How Sift is accurately identifying anomalies in real time by using Imply Druid

by Amit Tomar · Constantine Gurnov · May 17, 2021

About Sift

Sift is the leader in Digital Trust & Safety, empowering companies of all sizes to unlock revenue without risk. Sift prevents fraud with industry-leading technology and expertise.

Challenges

As the leader in Digital Trust & Safety and a pioneer in using machine learning to fight fraud, Sift regularly deploys new machine learning models into production. Sift’s customers use the scores generated by machine learning models to decide whether to accept, block, or watch events and transactions. An example could be blocking all events with a high risk score. It’s very important that the new ML model releases do not cause a shift in score distributions. Score shift can cause customers to suddenly start blocking legitimate transactions and accepting fraudulent ones.

Sift’s customers have also built thousands of automated workflows to automatically determine whether to block, accept or watch a specific transaction or an event. Even a slight change in the score distributions may impact decision rates or introduce anomalies.

These anomalies in decision rates can be caused by internal changes in models and system components. They can also be caused by changes on the customers’ side: a change in integration, or decisioning behavior. Sometimes a change in decision rates is desirable – such as when there’s a fraud attack, entering into a new market, or a seasonal event.

The most important thing is to identify and triage changes in decision rates immediately in realtime to ensure that customers continue to get accurate results with Sift.

Sift needed a scalable product for monitoring customer outcomes in real time, knowing that each customer has unique traffic patterns and risk tolerances.

Solution requirements and decision to go with Imply’s distribution of Druid

The solution had to meet the following requirements:

  • Accurately identify anomalies per customer level in realtime
  • Automated real time alerts
  • High availability
  • Support realtime root cause analysis

Because each customer has unique traffic and decision patterns, Sift needed a tool, which can automatically learn what “normal” looks like for each customer.

Sift set out to build an automated monitoring tool, Watchtower, a system that would use anomaly detection algorithms to learn from past data and trigger alerts in realtime on unusual changes.

To allow them to transition to real time anomaly detection, Sift chose to start with open source Druid on AWS.

But soon they started to run into deployment and upgrade related issues on Open Source Druid. Sift then decided to rely on Imply’s distribution of Druid for deployment, upgrade and overall cluster management.

Solution Architecture

Watchtower architecture can be categorized in below mentioned four components:

  1. Scalable tooling for real time data collection
    Sift built a library to intercept requests to service, reformat and send data to a distributed messaging system. They used Kafka because of its scalability and fault tolerant characteristics.
  2. Data storage for aggregating and querying streaming data in near real time
    Sift needed to aggregate data by a variety of dimensions from thousands of servers. They needed a system that allows querying across a moving time window with real time analysis and visualization.
    They used Imply Druid for this realtime analysis and visualization, which in their case proved to be a good choice as a near-realtime OLAP engine.
  3. Services that ingest aggregated time-series data, run the anomaly detection algorithm, and generate reports
    Sift needed a service that can fit the following requirements:
    • Run jupyter notebooks with dependencies in virtual environments
    • Pull and cache time series data
    • Snapshot and store input time-series data for future investigations and algorithm tuning
  4. Tooling for Offline Training, Modeling, and Benchmarking Algorithms
    The algorithm must learn from unlabeled data, but they want to evaluate it against known historical anomalies. This helps them tune the sensitivity parameters of their algorithms, and provides context for support engineers. To do this, they built tools to join data from multiple data sources and run backtesting on algorithms. Once the algorithm is tuned, it can be deployed into the model shelf.
  5. User Interface - Imply Pivot

    Sift uses plots generated by anomaly detection notebooks for alerting via various channels. Each alert contains the name of the customer with anomaly in their decision pattern, the metric which triggered the alert, and a link to a dashboard that updates in realtime.

    Their support engineering teams use Imply Pivot to build out Dashboards and slice and filter data per decision type, product type, time of the day, etc. Support engineering team monitors statistics on decisions and score distributions.

    Here’s an example of a dashboard for one of Sift’s internal test customers presented below.

    Imply Pivot - Visualizing sub-second latency analytics

    Benefits

    From a business standpoint, Watchtower running on Imply Druid with Imply Pivot exceeded Sift’s expectations. In the first month of launch, they were able to detect several possible anomalies without human intervention. They analyzed the data and saw a variety of root causes such as:

    • Incorrect integration changes with Sift REST API on one of the customers side
    • A mis-calibrated model for one of their customers before they released the model
    • Severe fraud attack on one of their customers
    • Spikes related to expected change of patterns such as promotion event ad campaigns

    With Watchtower running on Imply Druid, their support engineering team was able to proactively contact customers quickly when anomalies were spotted, which prevented any potential business impact for their customers. This became possible with Imply Pivot, which allows the data to be analyzed in real time thereby reducing the time to resolve issues.

    Next Steps

    • Sift’s next steps will be focused on adoption of new use cases as well as improving anomaly detection algorithm performance. They have started testing a number of promising deep neural network algorithms, including variations of LSTM and CNN.
    • They plan to use Watchtower for new types of data besides the decisions, both in business and engineering. For example, they plan to use it to monitor score distributions and system loads.
    • They also want to make Watchtower a self-serve service , where engineers without machine learning and data science backgrounds can use Watchtower for anomaly detection in any type of application.
Back to blog

How can we help?