Atomic Replace in Polaris

Mar 23, 2022
Jad Naous

We recently launched Polaris, our offering of a fully managed database as a service built on top of Apache Druid. With Polaris, we wanted to make it easier for anyone to get started building modern analytics applications. More importantly, one of our core philosophies with Polaris is to make sure that every capability we make available in Polaris is rock solid and makes no surprises to users. To that end, even though Apache Druid provides the ability for users to replace data atomically, we didn’t offer it because we considered it “surprising” to users. Today we’re announcing that Imply has made it possible for both users of open-source Apache Druid as well as Polaris users to do atomic replacements of data intervals without worrying about surprising quirks.

Apache Druid’s replacement functionality offered users the ability to atomically replace data… with a twist. As readers familiar with Apache Druid may know, data in Druid is partitioned by time, and many of the data management operations that Druid offers work on a time-partition by time-partition basis. When asked to replace an interval of data, Druid will replace whole partitions within that interval with new data, but, and here’s the twist, only for partitions that actually have replacement data. Partitions within the replacement interval for which there’s no replacement data are not touched.

As an example, consider the below setup for an interval of data the user wants to replace, containing four time partitions (also known as time chunks by Druid developers). Now, say originally the data had [a0, b0, c0, d0] for each partition. The replacement data has [a1, <nothing>, b1, c1], meaning that, for the second partition, we didn’t have any data to replace the old data with. Users would generally expect that the resulting data available in Druid after a replacement would look like [a1, null, c1, d1]. Unfortunately, the replacement result is [a1, b0, c1, d1]; the data in the old partition continues to be available.

Partition time range[t0, t1)[t1, t2)[t2, t3)[t3, t4)
Existing dataa0b0c0d0
Replacement dataa1c1d1
Expected outputa1nullc1d1
Old behavior outputa1b0c1d1

There are many reasons for this behavior. Most revolve around the trade-offs that Druid makes to put more control into the hands of the experts so they can get peak performance at scale and simplifying data management on a time-partition basis. However, we wanted the expected behavior to be the default behavior to make Druid more accessible to newer users. This work required some architectural surgery to introduce the concept of “tombstones” into Druid. A more technical blog around tombstones will follow soon, but this new capability is exciting because it opens the door to many new data management capabilities. For example, it will help us make data management easier by reducing the strict dependence of data management operations on time partitioning.

This is just one of the many ways Imply is continuously working to make both open-source Druid and our Polaris offering easier to adopt and more approachable to users. On top of this improvement, we’ve also added the ability for Polaris users to upload and ingest CSV files, making it easier to load data from sources that do not export data in JSON. To learn more about Polaris, sign up for a free trial at https://imply.io/polaris-signup and get started building scalable modern analytics applications.

Other blogs you might find interesting

No records found...
Jun 17, 2024

Community Spotlight: Using Netflix’s Spectator Histogram and Kong’s DDSketch in Apache Druid for Advanced Statistical Analysis

In Apache Druid, sketches can be built from raw data at ingestion time or at query time. Apache Druid 29.0.0 included two community extensions that enhance data accuracy at the extremes of statistical distributions...

Learn More
Jun 17, 2024

Introducing Apache Druid® 30.0

We are excited to announce the release of Apache Druid 30.0. This release contains over 409 commits from 50 contributors. Druid 30 continues the investment across the following three key pillars: Ecosystem...

Learn More
Jun 12, 2024

Why I Joined Imply

After reviewing the high-level technical overview video of Apache Druid and learning about how the world's leading companies use Apache Druid, I immediately saw the immense potential in the product. Data is...

Learn More

Let us help with your analytics apps

Request a Demo