Imply Videos

Dec 9, 2022

Why & How We Built An Open-Source Spark Druid Connector – Spark Druid Segment Reader

Spark Druid Segment Reader is a Spark connector that extends Spark DataFrame reader and supports directly reading Druid data from PySpark. By using the DataSource option you can easily define the input path and date range of reading directories.

The connector detects the segments for given intervals based on the directory structure on Deep Storage (without accessing the metadata store). For every interval only the latest version is loaded. Data schema of a single segment file is inferred automatically. All generated schemas are merged into one compatible schema.

At Deep.BI, we work heavily with Druid on a daily basis and it is a central component of our real-time stack for both Analytics and Machine Learning & AI use cases. Other key components of our stack include Kafka, Spark, and Cassandra.

One of the challenges we faced was that many times our data science team would need to work on ad hoc analyses of data from Druid and that would add a lot of pressure on the cluster that was not necessary since the majority of the jobs could be done in Spark in a more familiar environment for them as well.

In order to reduce the workload for them, our data engineering team, and the load on our Druid cluster, we created the Spark Druid Segment Reader.

It is an open-source (Apache 2.0 license) tool that acts as a Druid raw data extractor to Spark. It imports Druid data to Spark with no Druid involvement and helps utilize the data by data scientists or run complex, ad-hoc analyses without putting pressure on the Druid cluster.

The connector enables various teams (e.g. Data Scientists) to access all historical data on-demand going years back and run any necessary analyses on the data. They can re-process historical data without Druid cluster involved (for extraction) and thanks to this are able to reduce data duplication by huge margins between the data lake and Druid Deep Storage.

You can find the GitHub repository here: