How to conduct your own Proof of Concept using Apache Druid

Preparing for a proof of concept

Introduction

The Apache® Druid database has been used around the world by application developers to deliver transformative analytics applications for leading organizations. Why Druid? Developers and architects choose Apache® Druid because of its unique capabilities to deliver:

Sub-second analytics at any data scale
Real-time queries on both streaming and batch data
High concurrency at the lowest cost
Scale up and down with ease

Built from the outset using a cloud-era design, this database for modern analytics applications continues to help organizations deliver resilient, scalable, and performant decision support solutions.

This guide for teams and their project managers will assist you in planning your first real test of Druid. In this guide, you will find a suggested POC approach and links with information to documentation and articles from Druid experts that will be useful along the way.

To illustrate this POC process, we’ll consider an example consisting of transactions from 5,000 retail stores over a two-year period. This approach attempts to minimize the scale at which tests are executed in order to minimize cost.

Join us in the Druid community, where you’ll find videos, articles and active Q&A forums to help in your journey.

Recommended Overall POC Process

Here’s our recommended approach:

Define your requirements and success criteria
Execute the POC
- Define the scope of the data, beginning with a small dataset, and then iterating while scaling up
- In each iteration, estimate cluster size, deploy, test and tune for efficiency
Complete your POC
- Achieve success criteria
- Extrapolate results to production level deployment plan

With all of that in mind, let’s dig in!

Guidelines for PoC Criteria

As an application developer, you care about how to use Druid to deliver the analytic experience for your users with the data, performance and reliability they need. We recommend that you start by identifying production-level requirements in three categories: Data Ingestion, Analytic Queries and System Behavior.

Data Ingestion

On a table-by-table basis, think about requirements for loading data in both batch and real-time, as well as retention needs:

On-going batch ingestion of <batch data rate> completed within <batch window>
- <batch data rate> the size of the data to load per batch
- <batch window> the maximum duration of the ingestion job
Backfill ingestion of <total backfill size> within <initial load period> covering <time horizon>
- <total backfill size> the size of the initial load data
- <initial load period> the maximum amount of time allocated for initial load
- <time horizon> the depth of the history being loaded
Real-time ingestion of <message rate> messages with an average size of <avg message size>
- <message rate> number of messages published per unit of time
- <avg message size> the average size of an individual message

Examples might be:

Ongoing batch ingestion of 2TB a day completed within 4 hours
Ongoing batch ingestion of 10GB an hour completed within 5 min
Backfill ingestion of 2PB within 2 weeks covering 2 years of events
Real-time ingestion of 100,000 msgs/minute with an average size of 2KB

Ad Hoc Analytics

Query workload requirements relate to query patterns and to performance targets. Query patterns describe the characteristics of a set of queries that are needed to provide application functionality.

Pattern definition: <query type> on <dataset> grouping by <grouping dimensions> over <data timeframe> and <expected filters>
- <query type> kinds of analytic queries: aggregation, topN, scan, search
- <dataset> target table
- <grouping dimensions> list of possible aggregation columns
- <data timeframe> the interval of time that the query should cover
- <expected filters> common WHERE clause columns
Performance target: <target query rate> with <target SLO>
- <target query rate> the number of queries during a time period
- <target SLO> the percentage of queries that complete below a given response time

Here are some examples:

Example 1:
- Pattern: Aggregate metrics (min/max/avg $ and units) on POS Data grouping by Vendor, Product Category, Subcategory, and Customer Type over the last 8 hours and filtering by store
- Performance: 50 queries per second (QPS) with 99% running under 1s
Example 2:
- Pattern: Find top 10 products by total sales on POS Data grouping by Customer Type over the last 30 days and filtering by store
- Performance: 1 queries per hour (QPS) with 99% running under 5s

System Behavior

In addition to driving data ingestion and analytic queries, Apache® Druid uses a design that helps solutions to be highly available and resilient to failures. Some teams may wish to test these characteristics during the POC, here are some examples of behavioral requirements:

High Availability
- Ingestion and query processing should automatically recover in case of any single component failures
Resilience
- Real-time ingestion survives task failures and continues to provide correct results
Integration
- Use JDBC to connect Tableau for visualization

Mapping Production Requirements to your POC

The POC should prove:

Apache® Druid meets your analytic query needs
Apache® Druid scales linearly with your data ingestion and analytic queries
You meet your success criteria

You can prove these points using a smaller version of your production requirements:

Start with a small version
Iterate to extrapolate your production needs
Stop iterating once you’ve confirmed scalability with your overall workload

To scale down your data, consider the following:

For your first iteration, and to keep cost low, use a small sample of your dataset (<1GB) that will fit in a single server deployment (even on a laptop)
For subsequent iterations, choose a subset using a specific dimension
- Ideally a dimension commonly used for filtering
- As an example, if your dataset covers 5000 stores, choose 1% or 50 stores as a starting point, and something larger for the next iteration, perhaps 10%
Make sure the time horizon supports the testing patterns
- As an example, if you’re doing year over year comparisons, you’ll need two years of data

When iterating, consider the following:

Tune your cluster to achieve your ingestion and analytic performance goals
- As an example, you may want to introduce rollup, additional hardware, or tune parameters
If you’re not seeing linear scalability, some additional tuning may be required
- As an example, you may need to employ approximations to achieve performance goals

Consider your POC complete when:

You have two or three iterations which scale at a predictable rate
You can confidently extrapolate to production level resources

Executing the POC

Your first iteration

The goal of the first iteration is testing functionality. Functional testing doesn’t necessarily require large amounts of data, so we recommend using one of the single server deployments with a small sample dataset (<1GB). The job here is to translate your query patterns into SQL and see that they run in Apache® Druid. This is what you’ll need to achieve this:

Single server deployment
Learn to load and query data
Load your own sample dataset
- Supported input formats
- Batch ingestion (pre-Druid 24.0)
- SQL-based ingestion (Druid 24.0 and above)
SQL Documentation

The SQL statements generated here can and should be used in subsequent iterations to drive concurrency testing.

Second iteration

Here’s where things get more real, testing with real timeframes and required ingestion methods, as well as higher concurrency in analytic queries.

Estimate required cluster size
Deploy the cluster (or deploy on K8s)
Load a significant portion of any historic data using batch ingestion (or SQL-based ingestion)
- 1%-5% is a good starting point
- Make sure it covers the full timeframe of your production needs
- With the retail example, we would load the data for two years for 50 stores
Initiate ongoing incremental loads
- Use 1%-5% of expected production incremental data volume
  - With the retail example, use the new transactions for the same 50 stores that were loaded in the historic data
- In scheduled batches
  - Create incremental batch files
  - Use cron, airflow, or some similar tool to schedule the ingestion API calls
  - Using the API to run the ingestion
- Or streaming ingestion (if applicable)
  - A good resource to estimate compute resource needs is this benchmark
Test and tune the queries produced in the initial iteration.
Drive concurrency using JMeter, or some similar tool, using the SQL API
Search for other tuning opportunities
- Higher concurrency may drive the need for further tuning of queries or increases in the scale of the deployment to achieve the performance goals
- It’s important to Configure the Connection Pool
- Metrics and monitoring may help you further tune your deployment
Before considering this iteration complete:
- Tune both ingestion and queries such that they meet success criteria
- Try to minimize the resources required to achieve performance goals

Subsequent iterations

Subsequent iterations seek to achieve performance goals at increasingly larger scales. Each iteration will entail:

Increasing ingested data volume
- I.e., number of rows in the tables
Increasing the throughput of the real-time data pipeline
- Assuming that you’re streaming data
Increasing concurrency of the analytic queries

At the start of each iteration, you’ll need to:

Estimate required size at new data volume
Resize your cluster (you can also resize on K8s)
Load more historic data using batch ingestion (or SQL-based ingestion)
- 2X or 10X would likely be good increments
- Make sure it covers the full timeframe of production needs
- With the retail example, we might load the data for two years for 500 stores
Scale ongoing incremental loads
- Use 2X or 10X of the previous iteration’s data volume
  - With the retail example, use the new transactions for the same 500 stores that were loaded in the historic data
Drive concurrency using JMeter, or some similar tool, using the SQL API
Search for other tuning opportunities
- Higher concurrency may drive the need for further tuning of queries or increases in the scale of the deployment to achieve the performance goals
- Remember to Configure the Connection Pool if changes are needed for the new scale
- Metrics and monitoring may help further tune the deployment

High Availability considerations

Apache^® Druid is configurable for high availability. Once configured for high availability, automatic recovery will occur for both ingestion and queries:

Batch ingestion
- Automatically retries failed tasks
Real-time ingestion
- Exactly once streaming ingestion
- Replicas in ingestion spec for automatic task failover and increased query parallelism
Queries
- Architecting Druid for failure

To test for failures, shut down individual components while the full workload is executing, and observe the recovery process.

Completing The POC

Have you achieved the success criteria? Is it possible to extrapolate the results to a production level deployment plan? Success!

Given that you have achieved your performance requirements in each iteration at different scales, It is now possible to extrapolate the results to a production level deployment plan. System behavior requirements are now achieved by configuring and testing high availability of the solution. Time to use Apache^® Druid to drive analytic applications at any scale!

Appendix

Community Resources

The Apache® Druid community is welcoming and active. Whether it’s coming along to a meetup, sitting in on a virtual Druid Drop-In, or asking a question in Druid Forum or the Apache® Druid Slack Workspace, we are all here to help one another. Here’s a sample of content that Druid’s community members have published around POCs.

Watch the “POC Fireside Chat with Rachel Pedreschi and Jon King” by Charles Smith

If you would like to hear more about any aspect of this process, book time with Imply’s Developer Relations team. We’ll gladly share our understanding and experience with you (without prejudice!), offering our thoughts on how you might want to approach evaluating Druid through a POC.

Schedule a Druid POC Clinic with a Developer Advocate

Training

Imply offers an Imply Faculty with a catalog of courses and labs for Apache® Druid. To accelerate your POC implementation and get the most out of Druid, we recommend that you take advantage of these course offerings:

What is Imply

Imply was founded by the creators of Druid to help developers, analysts, architects, and others in the Druid community get more from the database. Many of Imply’s employees are active contributors to open source projects, including Apache® Druid. Imply has sponsored this document because we are passionate about helping developers be successful with Apache® Druid.

One easy way to get started with Druid is Imply Polaris, a fully-managed Druid-as-a-Service. A free trial gets you a ready-to-use Druid environment in under 5 minutes.

APACHE DRUID

IMPLY PRODUCTS

INTEGRATIONS

By Functional Use

By Application

FEATURED

DRUID CASE STUDIES

Apache Druid

Content

Support