Loading data (other streams)

Imply can connect to any streaming data source through Tranquility, a package for pushing streams to Druid in real-time. If you haven't used Tranquility before, we recommend trying out the HTTP push tutorial first and then coming back to this page.

Note that with all Tranquility-based options, you must ensure that incoming data is recent enough (within a configurable windowPeriod of the current time). Older messages will not be processed in real-time. Historical data is best processed with batch ingestion or with the new, experimental Kafka indexing service.

If your streaming data source is Kafka, Imply additionally supports the Druid Kafka indexing service, which does not require Tranquility.

Server

Imply includes Tranquility Server, which lets you send data to Druid without developing a JVM app. You only need an HTTP client. By default, it is enabled in the single-machine quickstart configuration, but disabled in a clustered setup. If enabled in a clustered setup, it runs on your Data servers.

To enable Tranquility Server on your Data servers:

  • In conf/supervise/data.conf, uncomment the tranquility-server line.
  • In conf/tranquility/server.json, customize the properties and dataSources.
  • If your Data servers were already running, stop them (CTRL-C or bin/service --down) and start them up again.

For tips on customizing server.json, see the Tranquility Server documentation.

Kafka

Imply includes Tranquility Kafka, which lets you load data from Kafka into Druid without writing any code. You only need a configuration file. By default, it is disabled. If enabled, it runs on your Data servers.

In addition to Tranquility Kafka, Imply also includes the ability to load data from Kafka using the Kafka indexing service. See the Kafka indexing service page for a comparison of the two options.

To enable Tranquility Kafka in the single-machine quickstart configuration:

  • In conf/supervise/quickstart.conf, uncomment the tranquility-kafka line.
  • In conf-quickstart/tranquility/kafka.json, customize the properties and dataSources.
  • If your supervise command was already running, stop it (CTRL-C or bin/service --down) and start it up again.

To enable Tranquility Kafka on your Data servers:

  • In conf/supervise/data.conf, uncomment the tranquility-kafka line.
  • In conf/tranquility/kafka.json, customize the properties and dataSources.
  • If your Data servers were already running, stop them (CTRL-C or bin/service --down) and start them up again.

For tips on customizing kafka.json, see the Tranquility Kafka documentation.

JVM apps and stream processors

Tranquility can also be embedded in JVM-based applications as a library. You can do this directly in your own program using the Core API, or you can use the connectors bundled in Tranquility for popular JVM-based stream processors such as Storm, Samza, Spark Streaming, and Flink.

These modules are not included in the Imply distribution, but are available from Maven Central for inclusion in your own applications.

Concepts

Task creation

Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication, service discovery, and schema rollover for you, seamlessly and without downtime. You never have to write code to deal with individual tasks directly. But, it can be helpful to understand how Tranquility creates tasks.

Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of Druid segments. Tranquility coordinates all task creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same configuration, even on different machines, and they will send to the same set of tasks.

See the Tranquility overview for more details about how Tranquility manages tasks.

segmentGranularity and windowPeriod

The segmentGranularity is the time period covered by the segments produced by each task. For example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour each.

The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes (the default) means that any events with a timestamp older than ten minutes in the past, or more than ten minutes in the future, will be dropped.

These are important configurations because they influence how long tasks will be alive for, and how long data stays in the realtime system before being handed off to the historical nodes. For example, if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay around listening for events for an hour and ten minutes. For this reason, to prevent excessive buildup of tasks, it is recommended that your windowPeriod be less than your segmentGranularity.

Append only

Druid streaming ingestion is append-only, meaning you cannot use streaming ingestion to update or delete individual records after they are inserted. If you need to update or delete individual records, you need to use a batch reindexing process. See the Loading with files page for more details.

Druid does support efficient deletion of entire time ranges without resorting to batch reindexing. This can be done automatically through setting up retention policies.

Guarantees

Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be processed exactly once. In some conditions, it can drop or duplicate events:

  • Events with timestamps outside your configured windowPeriod will be dropped.
  • If you suffer more Druid Middle Manager failures than your configured replicas count, some partially indexed data may be lost.
  • If there is a persistent issue that prevents communication with the Druid indexing service, and retry policies are exhausted during that period, or the period lasts longer than your windowPeriod, some events will be dropped.
  • If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing service, it will retry the batch, which can lead to duplicated events.
  • If you are using Tranquility inside Storm or Samza, various parts of both architectures have an at-least-once design and can lead to duplicated events.

Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for historical data, we recommend a hybrid batch/streaming architecture.

How can we help?