Approximate Distinct Counts in Imply Polaris

Overview

“Guess how many jelly beans are in this jar!”—a popular contest based on the art of estimation. If you’re familiar with the age-old game, you’ll know that while there is an exact number of jelly beans, it’s often the person with the closest guess who wins the prize.

There are plenty of ways to guess the number of beans (seriously, a quick internet search will lead you down quite the rabbit hole). Of course, getting the exact number requires dumping out the jelly beans and counting them one by one, but who has time for that? The fastest method is to use approximation: counting, say, the number of jelly beans on the bottom layer of the jar and multiplying by the number of jelly bean layers stacked on top. While it may give only an approximate count, it will take much less time to achieve compared to an exact count.

This is the fundamental idea behind HyperLogLog (HLL) and Theta sketches in Imply Polaris.

How do I use sketches?

A common use case in data analytics is to count the distinct number of values within a group. For example, I may want to know how many unique customers bought an item at each one of my stores so I can improve my marketing precision. For BI and batch-based analytics, the following SQL expression would work well:

SELECT store, COUNT(DISTINCT customer_id) FROM table_name … GROUP BY store;

But in the world of real-time, modern data analytics, counting each jelly bean one by one is simply not fast enough. If I want speed and am willing to sacrifice a small and predictable amount of accuracy, I can use an HLL or Theta sketch. Sketches in general represent a class of algorithms that serve a variety of use cases when speed and low memory usage are required; sketches work using some really cool math. If your application scales to ingest billions of events per day and serves many active users, using sketches will dramatically reduce the volume and processing costs of data while maintaining high levels of query performance with high accuracy.

Polaris Support for Sketches

In our recent Shapeshift Milestone 2 announcement, we unveiled support for HLL and Theta sketches at query and ingestion time in Polaris.

You can use sketches in Polaris during query with a SQL statement such as:

SELECT store, APPROX_COUNT_DISTINCT_HLL(customer_id) FROM table_name … GROUP BY store;

While this is already quite fast when compared to the standard COUNT(DISTINCT customer_id), it does require significant storage and processing during query time. Using sketches at ingestion time summarizes input data, which improves rollup, reduces memory footprint, and increases query performance. Additionally, sketches allow for set operations, so you can go beyond the traditional COUNT(DISTINCT customer_id)and also conduct analyses involving intersections and unions of your data.

You can use sketches in Polaris during ingestion by defining a column in the table schema. For example, given a new column named user, you can assign it to an HLL Sketch or Theta Sketch data type and map the input data to the table column using the appropriate SQL expression, such as DS_HLL(“customer_id”) or DS_THETA(“customer_id”). With the pre-computed user column, you can now use the following query for faster results:

SELECT store, APPROX_COUNT_DISTINCT_HLL(user) FROM table_name … GROUP BY store

Enabling real-time analytics with sketches in Polaris is one of the many ways in which we are powering the next generation of modern data analytics applications. To try it out yourself, sign up at signup.imply.io.

Other blogs you might find interesting

No records found...

Apr 14, 2025

It’s Time to Rethink Observability: The Event-Driven Future

Observability has evolved. Forward-looking teams are already moving beyond static dashboards and fragmented telemetry—treating all observability data as events and unlocking real-time insights across their...

Learn More

Mar 31, 2025

5 Reasons to Use Imply Polaris over Apache Druid for Real-Time Analytics

Introduction Real-time analytics is a game-changer for businesses that need to make fast, data-driven decisions. Whether you’re analyzing user activity, monitoring applications and infrastructure, detecting...

Learn More

Feb 28, 2025

Introducing Apache Druid® 32.0

We are excited to announce the release of Apache Druid 32.0. This release contains over 341 commits from 52 contributors. It’s exciting to see a 30% increase in our contributors! Druid 32.0 is a significant...

Learn More

By Functional Use

By Application

FEATURED

DRUID CASE STUDIES

Apache Druid

Content

Support

Other blogs you might find interesting

Let us help with your analytics apps