Community Spotlight: Sparking that connection with Apache Druid

by Peter Marshall · in Community · September 6, 2021

Whether it’s editing documentation or contributing complex code, members of open source communities are free to come together and generate value that benefits all of us. The Imply community team spoke to an Apache Spark community member who is doing just that, working with other engineers to create connectivity between Spark and Apache Druid.  And along the way, I found that I had a mindset that needed changing...

Don’t laugh, but it was only last week that I first used git.

I know, I know… There’s no excuse: afterall, how can I possibly be an employee of a Silicon Valley software company with a mission to “make it as easy as possible for people to use Druid and build awesome data applications on top of it,” if I’ve never used git.

Charles Smith is Senior Technical Writer at Imply, helping to create bigger and better documentation around Apache Druid. Git came up as I was talking to Charles about how I could help make improvements to the documentation myself. Armed with my suggested alterations around “time chunks”, I met with Charles to discuss what I should do.

But alas, Git, the place where all these docs are retained inside the druid repository, was an alien concept. Despite much keenness, Git remained a dark art.

“So I edit it where?  And then… wait… how do I get things workflowed through to someone to approve?”, I ask.

“I don’t want to write a load of rubbish that’s not true.”

Charles looks quizzical.

“I mean, maybe I could write it in Google Docs, then share it with you and log a Jira? Then maybe your team could pick it up and review it.  And then I could work out the pull request thing later...”

A hand is raised into the view of the Zoom.

“Erm, Peter - I think you’re missing the point here. This is open source. This is the Apache way. This is what it’s all about.”

Blank British face.

“Peter – you write your thing – you contribute – you add it – a conversation ensues around your request – and then, if it’s gonna add value, it gets merged. This is the open source way.”

Eureka moment: I’m so used to The British Way: of hierarchy and Old Boy’s clubs, 32-page process maps, pink and orange and blue 5-part forms, and orderly queues that never end.  But the Apache Way – and by extension the Apache Druid Way – is driven by “consensus” and on “trust”.  Everyone is “equal irrespective of title”, a place where “votes hold equal weight”. It’s a system where people “help each other” to help everyone.

“Strong communities can always rectify problems with their code, whereas an unhealthy community will likely struggle to maintain a codebase in a sustainable manner.”

https://www.apache.org/theapacheway

It’s been nearly 10 years now since Druid was open sourced “to help other organizations solve their real-time data analysis and processing needs” (Tschetter), 3 years since incubation, and 2 years since becoming a top-level project. And this has happened not because of one person or one organisation or one country or one job title – but because people came together to help one another.

People like Julian Jaffe, an alumni of Pinterest and recently on-boarded at Netflix who has built and maintained near-real time and batch pipelines ingesting billions of events per day for internal & external partners, training algorithms, finance teams, and analysts. He has integrated Apache Druid into entire ecosystems, tuning and scaling it out, and supervising and mentoring teams throughout.

Where I saw a gap in the docs, Julian saw a gap in functionality – functionality that would allow people to integrate Druid’s Deep Storage with Apache Spark.

“Druid is really, really good if you're giving it already processed data or aggregating raw data at ingestion. But if you have raw events coming in on which you're doing further enrichment and processing, then out of the box Druid wants you to run all your processing, produce some intermediate file, and only then ingest it into the Druid cluster. That adds both latency and a lot of unnecessary resources.

“That's the impetus behind the Spark Druid Writer, to enable Spark users to write directly to Druid’s deep storage, and then to update the metadata with information about those new files.”

That’s the first step.  But then there’s getting data out of Druid and into Spark:

“There are existing Spark Druid Readers, but they all get at data by sending queries to your Druid cluster. Why does it make sense to use the resources for Druid that are dedicated for very low-latency, interactive analysis for a batch interface? There's not a human sitting there waiting for output, wanting to get something back quickly.

“This Spark Druid Reader’s intention is to talk to the Druid cluster to figure out where the data it cares about is on Deep Storage, and then to pull that into Spark for batch processing.

“Druid does real time analytics very well, because that’s what it’s been designed and built to do.  It doesn't do large scale, offline, batch processing. It shouldn't do that: it's not its job.

“But once you can read from and write to Druid from Spark, you can bring these two technologies together to solve more data problems.

“You might have data streaming into Druid, and then write a daily batch that has additional information that wasn’t available in real-time.

“Or maybe you have more esoteric requirements for compaction-style work beyond dropping dimensions and changing the time granularity: you might want to reduce the granularity of values in the data  when you don’t need that extra detail and want to reduce storage, like going from exact HTTP error codes for all requests for the last three days, to 200, 400, 500 for older data.”

Work progressed internally until Julian felt – like me – the call to give back.

“There are a lot of organisations out there building things for Druid. I work with people with the capacity to grab some random GitHub, pull it out and look at it, then adapt it to the company’s use case.  If I have to upgrade Druid and make that work compatible, we can do it. If we need to tweak it or adapt it, we can do it.

“A lot of people in the community don’t have the bandwidth to do that, so they get left out.

Julian raised the Issue in Github on 28th April 2020 after a discussion with community members. The Pull Request followed in February 2021, bringing more developers into the conversation.

“More eyeballs are always good.  People who can say ‘hey, there's this or that other approach.’  It’s nice to have a dialogue instead of a monologue!”

“Spark connectors are hard to build. For a start, there’s an awkward fit in the Druid code-base because they’re written in a different language; they don’t run on a Druid cluster. 

“And that has a knock-on in code review because there are fewer people who are familiar with both systems.”

This collaborative effort is all worth it.  I can’t remember a month without someone asking about Spark connectors for Druid, and seeing community members coming together in the Apache Way to make this happen for everyone’s benefit is inspiring. I guess it’s time for me to put my fear and processes and hierarchies back in the box where they belong, and go edit some Markdown!

*   *   *

Like me and like Julian, if you have something valuable to contribute, there is a community here to help.  And thanks to the Apache Way, no organisation, no person, no job title, will ever prevent us from “providing software for the public good”.

The “community” page on the Apache Druid project site suggests some things to help build and improve, people and places to connect with others, and links to processes and policies.

*   *   *

The community would love to hear your story!  Email community@imply.io to sign up for a 5-minute interview for your own Community Spotlight, and to discuss opportunities for blog posts and speaking slots, as well as to get the latest information about community activities across the world.  And we’re also here to help you get your name in lights on Apache Druid’s Powered By page.

Back to blog

How can we help?