Loading data

Choosing an ingestion method

Imply supports all of Druid's real-time and batch ingestion methods. The most popular configurations are:

  • From Files — Batch ingestion from HDFS, S3, local files, or any filesystem supported by Hadoop. We recommend this method if your dataset is already in flat files.

  • From Kafka — Streaming ingestion from Kafka can be done with either Tranquility or the Kafka indexing service. See our Loading from Kafka page for suggestions on what to choose.

  • From other streams — You can push any data stream into Druid in real-time using Tranquility, a client library for sending streams to Druid. We recommend this method if your dataset originates in a streaming system like Kafka, Storm, Spark Streaming, or your own system. This method only works on "real-time" data, and cannot be used to ingest historical data.

Getting started

The easiest ways to get started with loading your own data are the four included tutorials.

Hybrid batch/streaming

You can combine batch (file-based) and streaming methods in a hybrid batch/streaming architecture, sometimes called a "lambda architecture". In a hybrid architecture, you use a streaming method to do initial ingestion, and then periodically re-ingest older data in batch mode (typically every few hours, or nightly).

Hybrid architectures are simple with Druid, since batch loaded data for a particular time range automatically replaces streaming loaded data for that same time range. All Druid queries seamlessly access historical data together with real-time data. We recommend this kind of architecture if you need real-time analytics but also need the ability to reprocess historical data. Common reasons for reprocessing historical data include:

  • Most streaming ingestion methods currently supported by Druid do introduce the possibility of dropped or duplicated messages in certain failure scenarios, and batch re-ingestion eliminates this potential source of error for historical data.

  • You get the option to re-ingest your data if necessary in batch mode. This could occur if you missed some data the first time around, or because you need to revise your data. Because Druid's batch ingestion operates on specific slices of time, it is possible to simultaneously do a historical batch load and real-time streaming load.

Note that with the experimental Kafka indexing service, it is possible to reprocess historical data in a pure streaming architecture, by migrating to a new stream-based datasource whenever you want to reprocess historical data. This is sometimes called a "kappa architecture".

Realtime nodes

Imply supports using Realtime nodes to load data, but we generally do not recommend this. Realtime nodes are a legacy streaming ingestion mechanism that do not offer a way to easily achieve redundancy, durability, and high availability. They can also be difficult to manage at scale. We believe that in most cases, Tranquility or the Kafka indexing service are more suitable choices.

Imply does not include builtin configurations for Realtime nodes.

How can we help?