It’s an understatement to say 2022 was a year of much uncertainty, especially in the tech industry. The world is still grappling with how 2020 fundamentally changed the way we live and work and things like a looming recession have a lot of folks on edge. But in the midst of this, a true testament to the strength of the technology and community, Apache Druid had a breakout year.
There were many factors that contributed to Druid’s success this year but recognition needs to go to the Druid Community. Their dedication to the open source project is the reason Druid is better than ever. An open source project is only as strong as the folks who work on it so first and foremost- thank you to the more than 14,000 community members and the 500+ active contributors.
In a crowded database space, this Druid was able to break from the pack because of new analytics use cases that have emerged. Organizations are looking for technologies that can do more than a data warehouse or a transactional database can do on their own. They’re looking for options that take key technical traits from both. This need for a database that can take analytics beyond basic BI and enable applications from streaming data has created a new database category. And this new database category is essential to power a new use case: modern analytics applications. This is where Druid shines. As a real-time analytics database, Druid is designed for sub-second queries at TBs to PBs scale and high queries per second on real-time events.
Druid is getting noticed – not just by organizations looking for a database purpose built for these new analytics use cases. Apache Druid was part of InfoWorld’s the best open source software of 2022 and more recently, Druid was voted Reader’s Choice for Best Data and AI Product or Technology: Analytics Database in BigDATAwire’s (formally Datanami)’s 2022 Readers’ & Editors’ Choice Awards.
A New Shape for Apache Druid
Improvement and new features introduced in the release cycle this year have directly contributed to Druid’s rise this year. At the end of 2021, Gian Merlino, Apache Druid PMC chair and Imply co-founder announced Project Shapeshift. Shapeshift was an effort designed to improve the Druid experience to be more cloud-native, simple, and complete.
Interactive slice-and-dice, at scale and under load, without precomputation, is a key reason people choose to use Druid. However, there are other user-facing features that analytics applications need, like data export and reports, that rely on much longer-running or more complex queries. Previously in Druid, these workloads weren’t suitable at scale and developers needed to manage these workloads via other systems alongside Druid, causing separate pipelines and adding cost and complexity:
To tackle all of this, a key focus for Druid in 2022 was building a multi-stage query engine. Doing so would not only open up the ability to run these types of queries but also simplify batch ingestion and enable SQL end-to-end. But let’s go back and look at all of the new features, capabilities, and improvements that happened between two key releases.
Apache Druid 0.23, 24.0 and the Multi-stage Query Engine
In 2022, Apache Druid had two major releases- Apache 0.23, which contained over 450 updates from 81 contributors and 24.0, which had over 324 updates from over 63 contributors.
Druid 0.23 saw the addition of new, smaller-but-still-important capabilities such as better tag data analysis with GROUP BY on multi-valued dimensions without unnesting and support for additional timestamps in time-series event data. Support for a new type of partitioning scheme was also introduced – multi-dimensional (multi-dim) partitioning. Previously, Druid supported a few partitioning schemes that could be used to optimize query performance. Each partitioning scheme had its own limitations and tradeoffs. Addition features and improvements included:
- Atomic replacement and filtered deletion
- Tighter integration by enhancing Kafka inputFormat.
- Easier JSON data ingestion with extended JsonPath functions
- Improved SQL error messages
- Query cancellation API
- Safe divide function for SQL queries
But September’s Druid 24.0 release was more than just a new decimal point placement- it was a significant leap forward for the Druid engine. In this release, two groundbreaking features were introduced – the aforementioned multi-stage query (MSQ) engine and support for nested JSON columns.
The multi-stage query engine marks the first step toward a universal query engine that’s both high performance and highly versatile. Initially, it is simplifying ingestion of batch data in Druid using SQL and enabling in-database data transformation, making the data prep process and tooling needed before ingestion a lot easier. It is also highly performant; based on our benchmarking, batch ingestion is at least 40% faster than the original Druid batch ingestion engine. Before 24.0, to load data into Druid, you’d need to learn to use the Druid ingestion spec. With Druid 24.0, you can now use SQL queries to load data into Druid, made possible by the multi-stage query engine.
Nested Column support enables you to ingest nested JSON columns and retain the nested structure while providing the fast performance you expect from Druid during querying. It’s as simple as specifying the data type as “JSON” as part of your data ingestion spec. This pairs well with the new SQL-based batch ingestion, so you can simply specify the column type to be “COMPLEX<json>”.
Once loaded, you can use functions like JSON_VALUE, JSON_QUERY, and others to query the data stored in the nested column. You can expect performance that matches or exceeds the performance of other Druid column types. For columns that are numerical types, you can expect 10-50% better performance when they are part of a nested JSON column.
Druid’s Success Contributes to Imply’s Momentum
Druid’s breakout year also fueled Imply’s momentum. It was a huge year for us as well! Most notably, in May 2022, Imply cemented its unicorn status with an $100MM Series D funding round led by Thoma Bravo growth, with participation from OMERS, and existing investors Bessemer Venture Partners, Andreessen Horowitz, and Khosla Ventures.
Another highlight coming out of Project Shapeshift was Imply Polaris. In March, Imply announced the fully managed database-as-a-service (DBaaS). Developers get Apache Druid’s best-in-class speed and scale and Imply takes care of the database and infrastructure. Imply Polaris eliminates the infrastructure concerns that come from managing your own instance of a distributed database like Druid with fully automated infrastructure provisioning, setup, and deployment, provides dynamic scaling for data ingestion resources based on actual demand, and automates configurations and tuning parameters so you don’t have to bother with “turning knobs.” In addition, Polaris has a built-in push-based streaming service and a visualization engine integrated into a single UI delivers what you need to start fast.
Since its launch, Polaris now has over 500+ accounts building with it, including Zillow, who presented at the 2022 Druid Virtual Summit.
Imply also hit the road. Druid Summit on the Road made stops all over the globe. We also made our presence known at MongoDB World, Confluent’s Current, and AWS re:Invent. Imply took the stage at Current 2022, which also featured Imply customers Reddit and Citrix. Co-founder Gian Merlino took the stage with Confluent’s CEO Jay Kreps– Imply and Confluent go together like Apache Druid and Apache Kafka. Thousands of businesses are already using streaming platforms like Kafka with Druid to build cutting-edge applications that analyze terabytes to petabytes of streaming data in milliseconds.
And if you caught the opening video, you might have spotted Field CTO Eric Tschetter being interviewed. Speaking of Eric Tschetter, he was the first guest on Imply’s newly launched podcast, Tales at Scale that’s poised to take 2023 by storm.
Speaking of new programs (albeit, a different type of program), Imply also introduced the Total Value Guarantee (TVG) for qualified participants that the total cost of ownership (TCO) with Imply – measured across software, support, and infrastructure – will be less than the current TCO to run Apache Druid.
Even though 2022 is coming to a close, we decided to go out in style with Druid Virtual Summit. This year’s event saw presentations from Netflix, Reddit, Poshmark, and of course, our friends at Confluent.
Though it was a monumental year, it feels like it’s just the beginning. Both Druid and Imply have big plans for 2023 and we hope you’ll be there with us. Have a safe and happy New Year.