Supercharge Hadoop with real-time analytics

A Hadoop-based data lake can do a lot of things; real-time analytics is not one of them. Connecting Imply and Apache Druid to Hadoop lets you visually explore and interact with your big and fast operational data to find, fix and prevent problems.

The Hadoop mascot is an elephant, not a jackrabbit

Hadoop was designed to be a powerful general compute framework to process enormous amounts of data, but it was not built for speed.

  • Latency is created from initiating MapReduce jobs.
  • ELT adds delay as raw data is prepared and normalized for analysis.
  • Real-time exploratory analytics demand high-speed ingestion and sub-second query response.

Druid provides instant analytics for everyone

Druid is a real-time analytics database designed expressly for fast slice-and-dice OLAP queries on large data sets for technical, operational and business users. Druid powers use cases requiring real-time ingest, sub-second query performance, and high uptime.

  • Ingests millions of events per second via its native HDFS connector to ensure data freshness.
  • Delivers sub-second query response without pre-computation, for interruption-free, exploratory analysis.
  • Used at scale at data-driven companies like Lyft, Netflix and Airbnb.
  • Used to analyze operational data from clickstreams, system logs, netflows and IoT device streams.

Choose your architecture

There are three architectures for combining Druid with Hadoop depending on your data sources and infrastructure.

Real-time data lake

Raw data is first loaded into HDFS and cleaned or transformed (ELT) using MapReduce. This data is then loaded into Druid for queries. Druid loads data by converting it, or “indexing” it, into Druid segments. Druid has a built-in Hadoop connector that uses MapReduce to create these segments.

Data river

Druid is often used as part of an end-to-end streaming analytics stack. In the pure streaming (data river) architecture the components include Kafka to deliver raw data to downstream ETL and query systems, an optional stream processor to process/clean data (optional), a query system (Druid) to answer queries on data and HDFS as deep storage for Druid.

Lambda architecture

Druid is capable of supporting both batch ingests from HDFS and streaming data from Kafka for the same data source. Raw data is sent to Kafka, where it can write it to both Druid and HDFS. Of course, raw data can also be written directly to HDFS.

Imply completes Druid for the enterprise

The Imply solution is the industry’s most complete real-time analytics offering, developed by the authors of Druid. Imply surrounds Druid with drag-and-drop visualization, cluster and query monitoring management and enterprise-grade security. Visit the Imply product page to learn more.

Imply is available as both on-prem and as a managed cloud service deployed to your AWS VPC (you control your data).

Loading data from Hadoop

In this tutorial, you'll load files into Druid using Hadoop in local standalone mode and how to automatically parallelize ingestion using a remote Hadoop cluster.

Learn more

Hadoop indexing for Google Cloud Dataproc

This tutorial walks you through how to Druid to use Dataproc (GCP’s managed Hadoop offering) for Hadoop Indexing.

Learn more

How can we help?