Druid is a real-time analytics database designed expressly for fast slice-and-dice OLAP queries on large data sets for technical, operational and business users. Druid powers use cases requiring real-time ingest, sub-second query performance, and high uptime.
Raw data is first loaded into HDFS and cleaned or transformed (ELT) using MapReduce. This data is then loaded into Druid for queries. Druid loads data by converting it, or “indexing” it, into Druid segments. Druid has a built-in Hadoop connector that uses MapReduce to create these segments.
Druid is often used as part of an end-to-end streaming analytics stack. In the pure streaming (data river) architecture the components include Kafka to deliver raw data to downstream ETL and query systems, an optional stream processor to process/clean data (optional), a query system (Druid) to answer queries on data and HDFS as deep storage for Druid.
Druid is capable of supporting both batch ingests from HDFS and streaming data from Kafka for the same data source. Raw data is sent to Kafka, where it can write it to both Druid and HDFS. Of course, raw data can also be written directly to HDFS.
The Imply solution is the industry’s most complete real-time analytics offering, developed by the authors of Druid. Imply surrounds Druid with drag-and-drop visualization, cluster and query monitoring management and enterprise-grade security. Visit the Imply product page to learn more.
Imply is available as both on-prem and as a managed cloud service deployed to your AWS VPC (you control your data).
In this tutorial, you'll load files into Druid using Hadoop in local standalone mode and how to automatically parallelize ingestion using a remote Hadoop cluster.Learn more
This tutorial walks you through how to Druid to use Dataproc (GCP’s managed Hadoop offering) for Hadoop Indexing.Learn more