Loading files

Getting started

We recommend performing Druid batch loads with the Hadoop-based indexing task.

To load batch data, you'll need:

If you've never loaded data files into Druid before, we recommend trying out the quickstart first and then coming back to this page.

Connecting to a Hadoop cluster

Connecting Druid to a Hadoop cluster allows Druid to load data from files on HDFS, S3, or other filesystem via Map/Reduce jobs. Those Map/Reduce jobs will scan through your raw data and produce optimized Druid data segments in your configured deep storage. The data will then be loaded by Druid Historical Nodes. Once loading is complete, Hadoop is not involved in the query side of Druid in any way.

The main advantage of connecting to a Hadoop cluster is that it automatically parallelizes the batch data loading process.

To connect Druid to a Hadoop cluster:

  • Update druid.indexer.task.hadoopWorkingPath in conf/druid/middleManager/runtime.properties to a path on HDFS that you'd like to use for temporary files required during the indexing process. druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing is a common choice.

  • Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into conf/druid/_common/core-site.xml, conf/druid/_common/hdfs-site.xml, and so on.

  • Ensure that you have configured a distributed deep storage. Note that while you do need a distributed deep storage in order to load data with Hadoop, it doesn't need to be HDFS. For example, if your cluster is running on Amazon Web Services, we recommend using S3 for deep storage even if you are loading data using Hadoop or Elastic MapReduce.

Elastic MapReduce setup

If your cluster is running on Amazon Web Services, you can use Elastic MapReduce (EMR) to index data from S3 or HDFS. To do this:

  • Create a persistent, long-running cluster.
  • When creating your cluster, enter the following configuration. If you're using the wizard, this should be in advanced mode under "Edit software settings".

    classification=yarn-site,properties=[mapreduce.reduce.memory.mb=6144,mapreduce.reduce.java.opts=-server -Xms2g -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.map.java.opts=758,mapreduce.map.java.opts=-server -Xms512m -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps,mapreduce.task.timeout=1800000]
  • If you're loading data from S3, then in the jobProperties field in the tuningConfig section of your Hadoop indexing task, add:

    "jobProperties" : {
     "fs.s3.awsAccessKeyId" : "YOUR_ACCESS_KEY",
     "fs.s3.awsSecretAccessKey" : "YOUR_SECRET_KEY",
     "fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
     "fs.s3n.awsAccessKeyId" : "YOUR_ACCESS_KEY",
     "fs.s3n.awsSecretAccessKey" : "YOUR_SECRET_KEY",
     "fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
     "io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"

    This method uses Hadoop's builtin S3 filesystem rather than Amazon's EMRFS, and is not compatible with Amazon-specific features such as S3 encryption and consistent views. If you need to use those features, you will need to make the Amazon EMR Hadoop JARs available to Druid through one of the mechanisms described in the Using other Hadoop distributions section.

Using other Hadoop distributions

Druid works out of the box with many Hadoop distributions. If you are having dependency conflicts between Imply and your version of Hadoop, you can try reading the Druid Different Hadoop Versions documentation, searching for a solution in the Druid user groups, or contacting us for help.

Loading additional data

When you load additional data into Druid using follow-on indexing tasks, the behavior depends on the intervals of the follow-on tasks. Batch loads in Druid act in a replace-by-interval manner, so if you submit two tasks for the same interval, only data from the later task will be visible. If you submit two tasks for different intervals, both sets of data will be visible.

This behavior makes it easy to reload data that you have corrected or amended in some way: just resubmit an indexing task for the same interval, but pointing at the new data. The replacement occurs atomically.

Loading without Hadoop

If you do not configure for connecting to a Hadoop cluster, then your indexing jobs will run in local mode. This means that by default, they will read files from your Data servers' local disks. Each indexing task you submit will run single-threaded. To parallelize the data loading process, you can partition your data by time (e.g. hour, day, or some other time bucketing) and then submit an indexing task for each time partition. Indexing tasks for different intervals can run simultaneously.

For large datasets, connecting to Hadoop is recommended as it allows automatic parallelization of the data loading process.

How can we help?