Quickstart

The easiest way to evaluate Imply is to install it on a single machine. In this quickstart, we'll set up the platform locally, load some example data, and visualize the data.

Prerequisites

You will need:

  • Java 7 or better
  • Node.js 4.x or better
  • Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
  • At least 4GB of RAM

On Mac OS X, you can use Oracle's JDK 8 to install Java and Homebrew to install Node.js.

On Linux, your OS package manager should be able to help for both Java and Node.js. If your Ubuntu- based OS does not have a recent enough version of Java, WebUpd8 offers packages for those OSes. If your Debian, Ubuntu, or Enterprise Linux OS does not have a recent enough version of Node.js, NodeSource offers packages for those OSes.

Getting started

First, download Imply 2.0.0 from imply.io/download and unpack the release archive.

tar -xzf imply-2.0.0.tar.gz
cd imply-2.0.0

In this package, you'll find:

  • bin/* - run scripts for included software.
  • conf/* - template configurations for a clustered setup.
  • conf-quickstart/* - configurations for this quickstart.
  • dist/* - all included software.
  • quickstart/* - files useful for this quickstart.

Start up services

Next, you'll need to start up Imply, which includes Druid, Pivot, and ZooKeeper. You can use the included supervise program to start everything with a single command:

bin/supervise -c conf/supervise/quickstart.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, remove the var/ directory.

Congratulations, now it's time to load data!

Load data file

We've included a sample of Wikipedia edits from June 27, 2016 to get you started with batch ingestion.

This section shows you how to load data in batches, but you can skip ahead to learn how to load streams in real-time. Druid's streaming ingestion can load data with virtually no delay between events occurring and being available for queries.

The dimensions (attributes you can filter and split on) in the Wikipedia dataset, other than time, are:

  • channel
  • cityName
  • comment
  • countryIsoCode
  • countryName
  • isAnonymous
  • isMinor
  • isNew
  • isRobot
  • isUnpatrolled
  • metroCode
  • namespace
  • page
  • regionIsoCode
  • regionName
  • user
  • commentLength
  • deltaBucket
  • flags
  • diffUrl

The metrics (values you can aggregate to make measures) in the Wikipedia dataset are:

  • count
  • added
  • deleted
  • delta
  • user_unique

To load this data into Druid, you can submit an ingestion task pointing to the file. We've included a task that loads the wikiticker-2016-06-27-sampled.json file included in the archive. To submit this task, run the following script from your Imply directory:

bin/post-index-task --file quickstart/wikiticker-index.json

Which will print something like:

Task started: index_hadoop_wikiticker_2016-06-27T22:51:29.465Z
Task log:     http://localhost:8090/druid/indexer/v1/task/index_hadoop_wikiticker_2016-06-27T22:51:29.465Z/log
Task status:  http://localhost:8090/druid/indexer/v1/task/index_hadoop_wikiticker_2016-06-27T22:51:29.465Z/status
Task index_hadoop_wikiticker_2016-06-27T22:51:29.465Z still running...
Task index_hadoop_wikiticker_2016-06-27T22:51:29.465Z still running...
Task index_hadoop_wikiticker_2016-06-27T22:51:29.465Z still running...
Task finished with status: SUCCESS

You can see more information about ingestion tasks in your cluster by using your overlord console: http://localhost:8090/console.html.

After your ingestion task finishes, the data will be loaded by historical nodes and available for querying within a minute or two. You can monitor the progress of loading your data in the coordinator console, by checking whether there is a datasource "wikiticker" with a blue circle indicating "fully available": http://localhost:8081/#/.

Once the data is fully available, you can immediately query it.

Query data

We've included several different ways you can interact with the data you've just ingested.

Direct Druid queries

Druid supports a rich family of JSON-based queries. We've included an example topN query in quickstart/wikiticker-top-pages.json that will find the most-edited articles in this dataset:

curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wikiticker-top-pages.json http://localhost:8082/druid/v2/

PlyQL

PlyQL is a SQL-like query language for Druid, included in the Imply distribution. You can query it using the included plyql command line tool:

bin/plyql -h localhost:8082 -q "SELECT page, SUM(count) AS Edits FROM wikiticker WHERE '2016-06-27T00:00:00' <= __time AND __time < '2016-06-28T00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 5"

This will return the five most frequently edited pages for the day:

page                                                                  Edits
Copa América Centenario                                               29
User:Cyde/List of candidates for speedy deletion/Subpage              16
Wikipedia:Administrators' noticeboard/Incidents                       16
2016 Wimbledon Championships – Men's Singles                          15
Wikipedia:Administrator intervention against vandalism                15

Pivot

Pivot is a web-based exploratory visualization UI for Druid. Imply includes a 30 day trial evaluation of Pivot, which you can visit at: http://localhost:9095.

With Pivot, you explore a dataset by filtering and splitting it across any dimension. For each filtered split of your data, Pivot can show you the aggregate value of any of your measures. For example, on the wikiticker dataset, you can see the most frequently edited pages by splitting on "page" (drag "Page" to the "Split" bar) and sorting by "Edits" (this is the default sort; you can also click on any column to sort by it).

Pivot offers different visualizations based on how you split your data. If you split on a string column, you will generally see a table. If you split on time, you can see either a timeseries plot or a table.

Pivot can be customized through settings, as described on our Pivot configuration page.

Other options

There are many more query tools for Druid than we've included here, including other UIs, other SQL engines, and libraries for various languages like Python and Ruby. Please see the list of libraries at the Druid site for more!

Next steps

So far, you've loaded a sample data file into an Imply installation running on a single machine. Next, you can:

How can we help?