Every day we talk to companies about using Apache Druid (incubating) for real-time analytics, and inevitably each will ask why they can’t get the job done using their data warehouse, whether it’s Teradata, Snowflake, RedShift, Big Query or something else.
This short post is my attempt to describe how Druid compares against enterprise data warehouses. If it’s not obvious by now, Druid is not a data warehouse and isn’t designed to replace every use case for which a data warehouse can be used.
Druid is a new type of database that’s a great fit if you are powering a user-facing analytics application and low latency is important. Druid is really good at ingesting data extremely fast (millions of events per second) and then also answering ad-hoc analytic queries with low latency, even when there are many concurrent users.
This kind of workload is not what data warehouses are designed for. Most data warehouses are built to answer large, complex SQL queries from professional analysts. These queries may take minutes to hours to complete and that’s fine because they aren’t driven by a real time requirement.
In contrast, Druid can complete most queries against very large data sets in under a second. The tradeoff between using Druid versus a data warehouse comes down to what is important for your use case: do you need the full flexibility of a data warehouse to answer every arbitrary query an analyst can devise, or do you need a real-time responsive end user experience where users can creatively explore data through iterative ad-hoc queries and have sub-second results? For the latter case a data warehouse just isn’t enough.
Examples of use cases where ad hoc analysis is important usually are the operational side of analytics. They include quickly understanding anomalies and patterns in clickstreams for digital marketing and product interactions, detecting and diagnosing network traffic issues, and others. This type of analysis has a different flavor than advanced analytics performed by the BI group; the queries are simple and build on each other in an unplanned fashion as users explore the data creatively.
Although Druid incorporates architectural concepts from data warehouses such as column-oriented storage, it also includes designs from search systems and time series databases, which makes it a great fit for analyzing various types of event-driven data.
In a nutshell, Druid’s architecture offers the following advantages over traditional data warehouses:
- Low latency streaming ingest
- Integration with Apache Kafka, Storm, Spark Streaming, Kinesis and other big data stream processors
- Time-based partitioning, which enables performant time-based queries
- Fast search and filter, for fast ad-hoc slice and dice
- Minimal schema design and native support for semi-structured and nested data
You should consider using Druid to augment your data warehouse if your use case has one or more of the following requirements:
- involves streaming data
- requires low-latency ingest at scale
- expects low-latency query response with high concurrency
- needs ad-hoc analytics
Druid is great for OLAP-style slice and dice and drill downs in these situations.
To summarize, data warehouse technology is better for use cases where the end user is a technical analyst and query flexibility takes precedence over performance. Druid shines when the use cases involve real-time data and where the end-user (technical or not) wants to apply numerous simple queries through an application. In the latter cases, query response and data freshness take precedence over coding complex queries.