Using Upserts in Imply Polaris

Jul 03, 2024
Rohan Joshi

An upsert is a database operation that will update an existing row if a specified value already exists in a table, or insert a new row if the specified value doesn’t already exist. 

Upserts are particularly useful for two key reasons:

  1. Data Consistency. Upserts ensure that data remains consistent by either updating existing records or adding new ones.
  2. Efficiency. Upserts combine the insert and update operations into a single action, reducing the number of database queries and improving performance.

In this blog, we’ll discuss how we support upserts in Imply Polaris for a variety of use cases.

Using Dimension Tables in Polaris

Today, Polaris supports a smaller version of an upsert that we are calling a Dimension Table. Fundamentally, they are database tables that store the latest snapshot of attribute data that changes over time. Uniquely, in Polaris, they can be hooked up to an event streaming pipeline like Kafka to upsert the latest attributes and then JOINed at query time for instantaneous change consumption.

They are distinct from fact/event tables in that they are not well-suited for transactions—but are beneficial in two primary ways:

  1. They are intuitive to understand and implement, dimension tables make it easy for end users to find the data they need.
  2. They are great for simple queries.

Dimension Tables support ingestion from Kafka, MSK, and Confluent Cloud, as well as near-instant updates of data and deletion events.

Let’s examine how Dimension Tables work.

  1. Dimension Tables are populated from Kafka via a change data capture mechanism.
  2. Data comes in a single topic, Imply Polaris reads the entire topic, creates the table, and updates it.
  3. Every 30 minutes, we write a snapshot to deep storage and the Druid task cycles and starts again.
  4. A user can use a JOIN to combine data in that Dimension Table with event data in a separate table in Polaris for a wide range of tasks, such as building a line chart that provides a global view of all their IoT assets in real time.

Today, we have a few customers using these new tables in a production environment. MetaCX, which creates an analytics platform for impact-driven organizations, uses Dimension Tables to identify normalization for customer-provided data without requiring denormalization.

“Polaris Dimension Tables make it possible for us to perform live identity normalization without having to re-ingest raw events when identity mappings are updated. With support for multiple columns, they’re the logical successor to Lookup functions.”

—Jason Schmidt, Software Architect, MetaCX

The advantages of using Lookups in Apache Druid 

Today, Lookups in Apache Druid are a way to manage changing dimension values over time. They are essentially key-value maps that associate a dimension value with a specific time interval. 

Lookups are useful in scenarios where the attribute values of a dimension can change, such as:

  • A customer’s address changes when they move, so an ecommerce platform uses Lookups to associate different values of the “address” dimension with specific time intervals.
  • A team uses Lookups to enrich data with external information, such as mapping a user ID to corresponding demographic data in a different table.

In this example, the Lookup function is used to retrieve the “product_name” from the “product_Lookup” table based on the “product_id” dimension.

To put it simply, Dimension Tables have two primary advantages over Lookups: they’re easy to set up and maintain, and they can effectively leverage JOINs.

Making updates to data in near-real time with Kafka-based Lookups presents a few challenges. In this case, the customer sets up a Kafka topic (containing key-value pairs) for the Lookup data. Some external system publishes the Lookup data to the Kafka topic and Druid subscribes to that topic. Druid then continuously ingests the updates and makes them available at query-time.

The problem is that the larger the dataset is, the longer it takes for Druid to read the whole topic. There’s a risk that Druid serves queries with partial data and at high latency.

While wonderful for small Lookup tables, they do not scale well when handling large Lookup data sets. If you have the same key and multiple dimensions you want to Lookup against, you need to have separate Lookup tables for each value.

With Dimension Tables, one simply needs one table that contains all the data that needs to be JOINed on at query time.

Dimension Tables are a different and improved way to achieve the same outcome as a Lookup table in open source Druid—one involving a more simplified setup while leveraging JOINs.

Looking to the Future

The addition of the Dimension Tables feature is the first stage of our Upsert roadmap. We are currently working on these additional capabilities for immediate release:

  • More event throughput
  • CDC input sources via Debezium
  • Upserts for entire event tables

We have seen a growing demand to support JOINing decorative metadata with event transactions at scale. Polaris’s ability to support JOINs at query time in conjunction with Upserts can unlock analytics at hyperscale.

Other blogs you might find interesting

No records found...
Jul 01, 2024

Make Imply Polaris the New Home for your Rockset Data

Rockset is deprecating its services—so where should you go? Try Imply Polaris, the database built for speed, scale, and streaming data.

Learn More
Jun 26, 2024

Announcing Imply Polaris on Microsoft Azure: Elevating Real-Time Analytics for Developers

We are excited to announce that Imply Polaris, our Database-as-a-Service (DBaaS) solution built from Apache Druid, is now available on Microsoft Azure. Azure customers worldwide can now take advantage of a...

Learn More
Jun 17, 2024

Community Spotlight: Using Netflix’s Spectator Histogram and Kong’s DDSketch in Apache Druid for Advanced Statistical Analysis

In Apache Druid, sketches can be built from raw data at ingestion time or at query time. Apache Druid 29.0.0 included two community extensions that enhance data accuracy at the extremes of statistical distributions...

Learn More

Let us help with your analytics apps

Request a Demo