Atomic Replace in Polaris

Mar 23, 2022
Jad Naous

We recently launched Polaris, our offering of a fully managed database as a service built on top of Apache Druid. With Polaris, we wanted to make it easier for anyone to get started building modern analytics applications. More importantly, one of our core philosophies with Polaris is to make sure that every capability we make available in Polaris is rock solid and makes no surprises to users. To that end, even though Apache Druid provides the ability for users to replace data atomically, we didn’t offer it because we considered it “surprising” to users. Today we’re announcing that Imply has made it possible for both users of open-source Apache Druid as well as Polaris users to do atomic replacements of data intervals without worrying about surprising quirks.

Apache Druid’s replacement functionality offered users the ability to atomically replace data… with a twist. As readers familiar with Apache Druid may know, data in Druid is partitioned by time, and many of the data management operations that Druid offers work on a time-partition by time-partition basis. When asked to replace an interval of data, Druid will replace whole partitions within that interval with new data, but, and here’s the twist, only for partitions that actually have replacement data. Partitions within the replacement interval for which there’s no replacement data are not touched.

As an example, consider the below setup for an interval of data the user wants to replace, containing four time partitions (also known as time chunks by Druid developers). Now, say originally the data had [a0, b0, c0, d0] for each partition. The replacement data has [a1, <nothing>, b1, c1], meaning that, for the second partition, we didn’t have any data to replace the old data with. Users would generally expect that the resulting data available in Druid after a replacement would look like [a1, null, c1, d1]. Unfortunately, the replacement result is [a1, b0, c1, d1]; the data in the old partition continues to be available.

Partition time range[t0, t1)[t1, t2)[t2, t3)[t3, t4)
Existing dataa0b0c0d0
Replacement dataa1c1d1
Expected outputa1nullc1d1
Old behavior outputa1b0c1d1

There are many reasons for this behavior. Most revolve around the trade-offs that Druid makes to put more control into the hands of the experts so they can get peak performance at scale and simplifying data management on a time-partition basis. However, we wanted the expected behavior to be the default behavior to make Druid more accessible to newer users. This work required some architectural surgery to introduce the concept of “tombstones” into Druid. A more technical blog around tombstones will follow soon, but this new capability is exciting because it opens the door to many new data management capabilities. For example, it will help us make data management easier by reducing the strict dependence of data management operations on time partitioning.

This is just one of the many ways Imply is continuously working to make both open-source Druid and our Polaris offering easier to adopt and more approachable to users. On top of this improvement, we’ve also added the ability for Polaris users to upload and ingest CSV files, making it easier to load data from sources that do not export data in JSON. To learn more about Polaris, sign up for a free trial at https://imply.io/polaris-signup and get started building scalable modern analytics applications.

Other blogs you might find interesting

No records found...
Nov 14, 2024

Recap: Druid Summit 2024 – A Vibrant Community Shaping the Future of Data Analytics

In today’s fast-paced world, organizations rely on real-time analytics to make critical decisions. With millions of events streaming in per second, having an intuitive, high-speed data exploration tool to...

Learn More
Oct 29, 2024

Pivot by Imply: A High-Speed Data Exploration UI for Druid

In today’s fast-paced world, organizations rely on real-time analytics to make critical decisions. With millions of events streaming in per second, having an intuitive, high-speed data exploration tool to...

Learn More
Oct 22, 2024

Introducing Apache Druid® 31.0

We are excited to announce the release of Apache Druid 31.0. This release contains over 525 commits from 45 contributors.

Learn More

Let us help with your analytics apps

Request a Demo