Oct 6, 2022

Approximate Distinct Counts in Imply Polaris

Overview

“Guess how many jelly beans are in this jar!”—a popular contest based on the art of estimation. If you’re familiar with the age-old game, you’ll know that while there is an exact number of jelly beans, it’s often the person with the closest guess who wins the prize.

There are plenty of ways to guess the number of beans (seriously, a quick internet search will lead you down quite the rabbit hole). Of course, getting the exact number requires dumping out the jelly beans and counting them one by one, but who has time for that? The fastest method is to use approximation: counting, say, the number of jelly beans on the bottom layer of the jar and multiplying by the number of jelly bean layers stacked on top. While it may give only an approximate count, it will take much less time to achieve compared to an exact count.

This is the fundamental idea behind HyperLogLog (HLL) and Theta sketches in Imply Polaris.

How do I use sketches?

A common use case in data analytics is to count the distinct number of values within a group. For example, I may want to know how many unique customers bought an item at each one of my stores so I can improve my marketing precision. For BI and batch-based analytics, the following SQL expression would work well:

SELECT store, COUNT(DISTINCT customer_id) FROM table_name … GROUP BY store;

But in the world of real-time, modern data analytics, counting each jelly bean one by one is simply not fast enough. If I want speed and am willing to sacrifice a small and predictable amount of accuracy, I can use an HLL or Theta sketch. Sketches in general represent a class of algorithms that serve a variety of use cases when speed and low memory usage are required; sketches work using some really cool math. If your application scales to ingest billions of events per day and serves many active users, using sketches will dramatically reduce the volume and processing costs of data while maintaining high levels of query performance with high accuracy.

Polaris Support for Sketches

In our recent Shapeshift Milestone 2 announcement, we unveiled support for HLL and Theta sketches at query and ingestion time in Polaris.

You can use sketches in Polaris during query with a SQL statement such as:

SELECT store, APPROX_COUNT_DISTINCT_HLL(customer_id) FROM table_name … GROUP BY store;

While this is already quite fast when compared to the standard COUNT(DISTINCT customer_id), it does require significant storage and processing during query time. Using sketches at ingestion time summarizes input data, which improves rollup, reduces memory footprint, and increases query performance. Additionally, sketches allow for set operations, so you can go beyond the traditional COUNT(DISTINCT customer_id)and also conduct analyses involving intersections and unions of your data.

You can use sketches in Polaris during ingestion by defining a column in the table schema. For example, given a new column named user, you can assign it to an HLL Sketch or Theta Sketch data type and map the input data to the table column using the appropriate SQL expression, such as DS_HLL(“customer_id”) or DS_THETA(“customer_id”). With the pre-computed user column, you can now use the following query for faster results:

SELECT store, APPROX_COUNT_DISTINCT_HLL(user) FROM table_name … GROUP BY store

Enabling real time analytics with sketches in Polaris is one of the many ways in which we are powering the next generation of modern data analytics applications. To try it out yourself, sign up at signup.imply.io.