Community Discoveries: multi-value dimensions in Apache Druid

by Hellmar Becker · Peter Marshall · in Community · October 4, 2021

Hellmar Becker is an Imply solutions engineer based in Germany, where he has been delving into the nooks-and-crannies of multi-valued dimension support in Druid.  In this interview, Hellmar explains why he invested time understanding this area of untapped functionality.

Hellmar first came across Druid’s support for multi-value dimensions just after joining Imply in 2020 while working with would-be adoptees in telecommunications and multimedia.

“The Druid data model is, in its simplest form, a very, very wide table.  It is not so much a classic OLAP model, it’s more like a model you’d find in statistical tools and in machine learning.  This makes it challenging to model one-to-many data, like a row that is actually made up of many rows.  When I saw multi-value dimensions, I knew I had to look into it further to see whether this was part of the answer.”

Multi-value dimensions are defined at ingestion time.  When incoming dimensions are registered as multi-value, Druid knows to treat this dimension differently at query time.

“At query time using SQL, the values are known to be an array.  So when I GROUP BY that dimension, for example, I would get an answer for each value in that array.  If I want an answer just for the row – and not for the array – there are SQL functions to allow you to do that.  This includes  ones that manipulate the array and do operations on it like joining elements together to create a string, or to pick individual elements out of the array.”

To understand this a bit more, Hellmar and I dreamt up a wild example: an array of “keys pressed”.  Imagining that the multi-value dimension contained “a, b, f, w, l” then a query could bring back a COUNT of the number of times, across the time period in our Druid query, that each key was pressed.

“Of course!  But let’s be more serious, Peter!”

(If you have seen Hellmar and me in action, you know we keep each other in order!)

“A commercial example would be movie views.  Imagine you had a multi-value dimension called ‘tags’ that contained genres, actors, and so on.”

Or the attributes of a visitor to our website?

“Yes, precisely.  Or maybe the stations that a person stopped at on a journey by train.”

Hellmar has one note of caution around schema evolution.

“Take care that Druid can be consistent in its results by being consistent with your field naming.  Changing your ingestion specification to make an existing field a multi-value type will not affect previously ingested data in that field.  New data will have the new type – old data will not.”

If you’re running Druid right now, maybe you are struggling with modelling 1-to-N relationships – maybe you are having to do lots of JOIN operations at query time to build up an array of values.  Hellmar’s research is for you: it is worth reading his blog series:

The steps to using multi-value dimensions are fairly simple.

First, look for candidates in your source data model – remembering that you may be working with source data that someone has already attempted to flatten out for Druid, and you may need to revisit their work to see if multi-value dimensions would work better.

Second, take care with the source data type.  Multi-value dimensions in Druid can only be strings.

Third, make sure that the source data is ready for the parser in Druid.  For example, if you’re using CSV, be cautious around the separators. 

And finally, there is a decision that needs to be made up-front – something slightly less well-known and described.

“Do you want to keep the order? Say someone went to one page on a website and then to another. Do you want to remove duplicates? Should they be sorted somehow – like that I just want to know which stations someone went to on the train but we don’t care about the order?”

Based on those decisions, the dimensions specification in Druid needs to be amended.

The result? Enhanced SQL queries.

“It’s important to know how your user interfaces will react when Druid processes the SQL.  Druid is going to execute the SQL query it is given and provide results in the way it thinks it should.”

I could not believe I had not known about multi-value dimensions before.

“Me neither!  And I’m only just getting started, Peter!  I am excited to go even deeper into it.”

Do you want to know more about multi-value dimensions? Come along and join Hellmar and me in the Druid Forum [https://www.druidforum.org].

***

The community would love to hear about your research and use of particular parts of Druid!  Email community@imply.io to sign up for a Community Heroics interview, to discuss opportunities for blog posts and speaking slots, and to get the latest information about community activities across the world.  And we’re also here to help you get your name in lights on Apache Druid’s Powered By page.

Back to blog
Tags: #apachedruid

How can we help?