Clustering

Imply is designed to be deployed as a horizontally scalable, fault-tolerant cluster.

In this document, we'll set up a simple cluster and discuss how it can be further configured to meet your needs. This simple cluster will feature scalable, fault-tolerant Query and Data servers, and a single Master server. In production, we recommend deploying the Master servers in a fault-tolerant configuration as well.

Select hardware

It's possible to run the entire Imply stack on a single machine with only a few GB of RAM, as you've done in the quickstart. For a clustered deployment, typically larger machines are used in order to handle more data. In this section we'll show you recommended hardware for a moderately sized cluster.

For this simple cluster, you will need one Master server, one or two Query servers (depending on whether or not you want redundancy), and as many Data servers as necessary to index and store your data.

Master servers coordinate data ingestion and storage in your Druid cluster. They are not involved in queries. They are responsible for starting new ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers. The equivalent of an AWS m3.xlarge is sufficient for most clusters. This hardware offers:

  • 4 vCPUs
  • 15 GB RAM
  • 80 GB SSD storage

Data servers store and ingest data. Data servers run Druid Historical Nodes for storage and processing of large amounts of immutable data, Druid MiddleManagers for ingestion and processing of data, and optional Tranquility components to assist in streaming data ingestion. These servers benefit greatly from CPU, RAM, and SSDs. The equivalent of an AWS r3.2xlarge is a good starting point. This hardware offers:

  • 8 vCPUs
  • 61 GB RAM
  • 160 GB SSD storage

For clusters with complex resource allocation needs, you can break apart the pre-packaged Data server and scale the components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, as well as eliminate the possibility of resource contention between historical workloads and real-time workloads.

Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that route queries to the appropriate data nodes. They include Pivot as a way to directly explore and visualize your data, Plywood as a data visualization API for applications, Druid's native JSON-over-HTTP query support, and PlyQL, a SQL-like interface for Druid. These servers benefit greatly from CPU and RAM, and can also be deployed on the equivalent of an AWS r3.2xlarge. This hardware offers:

  • 8 vCPUs
  • 61 GB RAM
  • 160 GB SSD storage

Select OS

We recommend running your favorite Linux distribution. You will also need:

  • Java 7 or better
  • Node.js 4.x or better

Your OS package manager should be able to help for both Java and Node.js. If your Ubuntu-based OS does not have a recent enough version of Java, WebUpd8 offers packages for those OSes. If your Debian, Ubuntu, or Enterprise Linux OS does not have a recent enough version of Node.js, NodeSource offers packages for those OSes.

Download the distribution

First, download Imply 2.0.0 from imply.io/download and unpack the release archive. It's best to do this on a single machine at first, since you will be editing the configurations and then copying the modified distribution out to all of your servers.

tar -xzf imply-2.0.0.tar.gz
cd imply-2.0.0

In this package, you'll find:

  • bin/ - run scripts for included software.
  • conf/ - template configurations for a clustered setup.
  • conf-quickstart/* - configurations for the single-machine quickstart.
  • dist/ - all included software.
  • quickstart/ - files related to the single-machine quickstart.

We'll be editing the files in conf/ in order to get things running.

Configure Master server address

In this simple cluster, you will deploy a single Master server running a Druid Coordinator, a Druid Overlord, a ZooKeeper server, and an embedded Derby metadata store.

In conf/druid/_common/common.runtime.properties, update these properties by replacing "master.example.com" with the IP address of the machine that you will use as your Master server:

  • druid.zk.service.host
  • druid.metadata.storage.connector.connectURI
  • druid.metadata.storage.connector.host

In conf/tranquility/server.json and conf/tranquility/kafka.json, if using those components, also replace "master.example.com" in these properties with the IP address of your Master server:

  • zookeeper.connect

In production, we recommend deploying redundant Master servers, each containing a Druid Coordinator and a Druid Overlord. We also recommend running a ZooKeeper cluster on its own dedicated hardware, as well as replicated metadata storage such as MySQL or PostgreSQL, on its own dedicated hardware. In this case, your common configurations should point to your ZooKeeper cluster and your MySQL or PostgreSQL metadata store rather than to your Master server.

Configure deep storage

Druid relies on a distributed filesystem or large object (blob) store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment).

S3

In conf/druid/_common/common.runtime.properties,

  • Set druid.extensions.loadList=["druid-s3-extensions"].

  • Comment out the configurations for local storage under "Deep Storage" and "Indexing service logs".

  • Uncomment and configure appropriate values in the "For S3" sections of "Deep Storage" and "Indexing service logs".

After this, you should have made the following changes:

druid.extensions.loadList=["druid-s3-extensions"]

#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments

druid.storage.type=s3
druid.storage.bucket=your-bucket
druid.storage.baseKey=druid/segments
druid.s3.accessKey=...
druid.s3.secretKey=...

#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs

druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=your-bucket
druid.indexer.logs.s3Prefix=druid/indexing-logs

HDFS

In conf/druid/_common/common.runtime.properties,

  • Set druid.extensions.loadList=["druid-hdfs-storage"].

  • Comment out the configurations for local storage under "Deep Storage" and "Indexing service logs".

  • Uncomment and configure appropriate values in the "For HDFS" sections of "Deep Storage" and "Indexing service logs".

After this, you should have made the following changes:

druid.extensions.loadList=["druid-hdfs-storage"]

#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments

druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://namenode.example.com:9000/druid/segments

#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs

druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=hdfs://namenode.example.com:9000/druid/indexing-logs

Also,

  • Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into conf/druid/_common/.

Configure Tranquility Server (optional)

Imply supports sending data streams to Druid through a simple HTTP API powered by Tranquility Server. If you will be using this functionality, then at this point you should configure Tranquility Server.

Configure Tranquility Kafka (optional)

Imply supports consuming streams from Kafka through Tranquility Kafka. If you will be using this functionality, then at this point you should configure Tranquility Kafka.

Configure for connecting to Hadoop (optional)

Imply supports loading batch data from HDFS, S3, and other filesystems via a Hadoop cluster. If you will be using this functionality, then at this point you should configure for connecting to your Hadoop cluster.

Tune Data server configuration

The Druid Historical Node and Druid MiddleManager that run on the Data server benefit greatly from being tuned to the hardware they run on. If you are using r3.2xlarge EC2 instances, or similar hardware, the configuration in the distribution is a reasonable starting point.

If you are using different hardware, we recommend adjusting configurations for your specific hardware. The most commonly adjusted configurations are:

  • -Xmx and -Xms
  • druid.server.http.numThreads
  • druid.processing.buffer.sizeBytes
  • druid.processing.numThreads
  • druid.query.groupBy.maxIntermediateRows
  • druid.query.groupBy.maxResults
  • druid.server.maxSize and druid.segmentCache.locations on Historical Nodes
  • druid.worker.capacity on MiddleManagers

Please see the Druid configuration documentation for a full description of all possible configuration options.

Tune Query server configuration

The Druid Broker thats run on the Query server benefits greatly from being tuned to the hardware it runs on. If you are using r3.2xlarge EC2 instances, or similar hardware, the configuration in the distribution is a reasonable starting point.

If you are using different hardware, we recommend adjusting configurations for your specific hardware. The most commonly adjusted configurations are:

  • -Xmx and -Xms
  • druid.server.http.numThreads
  • druid.cache.sizeInBytes
  • druid.processing.buffer.sizeBytes
  • druid.processing.numThreads
  • druid.query.groupBy.maxIntermediateRows
  • druid.query.groupBy.maxResults

Please see the Druid configuration documentation for a full description of all possible configuration options.

Open ports (if using a firewall)

If you're using a firewall or some other system that only allows traffic on specific ports, allow inbound connections on the following:

Master Server

  • 1527 (Derby; not needed if you are using a separate metadata store like MySQL or PostgreSQL)
  • 2181 (ZooKeeper; not needed if you are using a separate zookeeper cluster)
  • 8081 (Druid Coordinator)
  • 8090 (Druid Overlord)

In production, we recommend deploying ZooKeeper and your metadata store on their own dedicated hardware, rather than on the Master Server.

Query Server

  • 8082 (Druid Broker)
  • 9095 (Pivot)

Data Server

  • 8083 (Druid Historical)
  • 8091 (Druid Middle Manager)
  • 8100–8199 (Druid Task JVMs, spawned by Middle Managers)
  • 8200 (Tranquility Server)

Start Master server

Copy the Imply distribution, and your edited configurations, to your new Master server. If you have been editing the configurations on your local machine, you can use rsync to copy them:

rsync -az imply-2.0.0/ MASTER_SERVER:imply-2.0.0/

On your Master server, cd into the distribution and run this command to start a Master:

bin/supervise -c conf/supervise/master-with-zk.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Start Data servers

Copy the Imply distribution, and your edited configurations, to your Data servers. On each one, cd into the distribution and run this command to start a Data server:

bin/supervise -c conf/supervise/data.conf

The default Data server configuration launches a Druid Historical Node, a Druid MiddleManager, and optionally Tranquility components. New Data servers will automatically join the existing cluster. These services can be scaled out as much as necessary, simply by starting more Data servers.

For clusters with complex resource allocation needs, you can break apart the pre-packaged Data server and scale the components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, as well as eliminate the possibility of resource contention between historical workloads and real-time workloads.

Start Query servers

Copy the Imply distribution, and your edited configurations, to your Query servers. On each one, cd into the distribution and run this command to start a Query server:

bin/supervise -c conf/supervise/query.conf

The default Query server configuration launches a Druid Broker and a Pivot service. New Query servers will automatically join the existing cluster. These services can be scaled out as much as necessary, simply by starting more Query servers.

Loading data

Congratulations, you now have an Imply cluster! The next step is to load your data.

How can we help?