Clustered Apache Druid® on your Laptop – Easy!

Mar 07, 2022
Sergio Ferragut

Introduction

If you are ever in need of a Druid cluster deployment on your laptop, here’s how to do it. This is a quick guide to deploy Apache Druid ® using k8s minikube and min.io on your laptop. Why not use standalone or docker-compose versions instead, you ask? Good point, for the most part it is probably easier to do that. But if you are learning Apache Druid® and at some point you’ll want to scale out on some Kubernetes service in the cloud or on-premises, then it is good to understand how to deploy and manage in a Kubernetes context. This blog post is a starting point that lets you work with Kubernetes but without any cloud fees or extra hardware needed. These instructions are for MacOS with at least 6 CPUs and 8.5GB of memory to spare.

Helm Chart

We use the Apache Druid Helm Chart to deploy and manage the Druid database on Kubernetes. Our intent is to get you from zero to running Druid in just a few minutes and without any prior knowledge. These instructions will help you deploy Apache Druid as a distributed cluster on minikube adding min.io on k8s that uses local disk for deep storage. In the end, you have a deployment that does not require access to the internet.

Throughout this post, the kubernetes namespace is set to “dev”, you can change it to whatever you want as long as you change it everywhere. For all such parameters, we’ve color coded them like this, to indicate they are changeable.

So without further ado, the step by step instructions: 

Install a VM Manager

Minikube can use different VM management engines like docker and hyperkit. In this test we used hyperkit.

brew install hyperkit

Install Minikube, Kubectl and Helm and start minikube

Step 1– You’ll need homebrew to do this step. If you don’t already have it installed, you can find how to do that here.

Step 2 – Open a terminal window and run:

 brew install minikube

 brew install kubectl

 brew install helm

Step 3 – Create the minikube cluster with the following commands from the same terminal window. We’ll need enough memory and CPUs:

 minikube start --memory 8192 --cpus 6 —vm-driver=hyperkit

This will take some time and the output should look like this:

😄  minikube v1.23.2 on Darwin 11.3.1
✨  Automatically selected the hyperkit driver
👍  Starting control plane node minikube in cluster minikube
🔥  Creating hyperkit VM (CPUs=6, Memory=8192MB, Disk=20000MB) ...
🐳  Preparing Kubernetes v1.22.2 on Docker 20.10.8 ...
    ▪ Generating certificates and keys ...
    ▪ Booting up control plane ...
    ▪ Configuring RBAC rules ...
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟  Enabled addons: storage-provisioner, default-storageclass
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

Deploy min.io on Minikube

Step 1 – Open another terminal for min.io setup.

Step 2 – Add the min.io chart repository for helm:

 helm repo add minio https://charts.min.io/

Step 3 – Create a minio_values.yaml file with the following content:

mode: standalone
replicas: 1
rootUser: rootuser
rootPassword: rootpass123
persistence:
  size: 50Gi
  enabled: true
resources:
  requests:
    memory: 512M
environment:
  MINIO_SITE_REGION: “us-west-1”

Step 4 – Install min.io using helm:

 helm install -n dev --create-namespace -f minio_values.yaml minio minio/minio

NAME: minio
LAST DEPLOYED: Fri Feb 18 15:04:35 2022
NAMESPACE: dev
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
MinIO can be accessed via port 9000… 
...

Step 5 – Setup port forwarding from localhost into the min.io console service:

 export POD_NAME=$(kubectl get pods --namespace dev -l "release=minio" -o jsonpath="{.items[0].metadata.name}")

 kubectl port-forward $POD_NAME 9001 --namespace dev

Forwarding from 127.0.0.1:9001 -> 9001
Forwarding from [::1]:9001 -> 9001

This kubectl command will run as a foreground process, the terminal will continue to display port forwarding activity for as long as you need access to the Min.io UI.

Step 6 – Configure Deep Storage bucket – On a web browser go to http://localhost:9001

  • Logon with credentials rootUser, rootpass123
  • In Buckets, Create a bucket “druidlocal” with default settings.

Step 7 – In Users, select the console user, then:

  • Add a service account, use Custom Credentials to specify access key=access123, and secret key=secret1234567890 which simplifies the copy/pasting later in this doc.
  • In Settings / Region, set Server location to “us-west-1“, and click Save.

Step 8 – Restart the min.io service using the UI.

Deploy Apache Druid on Minikube

Now that we have an S3 API compatible storage manager running in the kubernetes cluster, we can deploy Apache Druid. All other dependencies of Druid are already a part of the project’s helm chart. So let’s start another terminal window and execute the following:

Step 1 – Clone the latest Apache Druid code:

git clone https://github.com/apache/druid

cd druid

Step 2 – Retrieve chart dependencies for zookeeper and postgresSQL: 

helm dependency update helm/druid

Getting updates for unmanaged Helm repositories...
...Successfully got an update from the "https://charts.helm.sh/incubator" chart repository
...Successfully got an update from the "https://charts.helm.sh/stable" chart repository
...Successfully got an update from the "https://charts.helm.sh/stable" chart repository
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "minio" chart repository
Update Complete. ⎈Happy Helming!⎈
Saving 3 charts
Downloading zookeeper from repo https://charts.helm.sh/incubator
Downloading mysql from repo https://charts.helm.sh/stable
Downloading postgresql from repo https://charts.helm.sh/stable
Deleting outdated charts

Step 3 – Create the parameter overrides needed to deploy locally by creating a file called k8s_minikube.yaml with the following content :

configVars:
  druid_extensions_loadList: '["druid-histogram","druid-datasketches","druid-lookups-cached-global","postgresql-metadata-storage","druid-s3-extensions"]'
  druid_storage_type: s3
  druid_storage_bucket: druidlocal
  druid_storage_baseKey: k8s-minikube/segments
  druid_s3_accessKey: access123
  druid_s3_secretKey: secret1234567890
  AWS_REGION: "us-west-1"
  druid_s3_forceGlobalBucketAccessEnabled: "false"
  druid_storage_disableAcl: "true"
  druid_indexer_logs_type: s3
  druid_indexer_logs_s3Bucket: druidlocal
  druid_indexer_logs_s3Prefix: k8s-minikube/logs
  druid_indexer_logs_disableAcl: "true"
  druid_s3_endpoint_signingRegion: "us-west-1"
  druid_s3_endpoint_url: "http://minio:9000"
  druid_s3_protocol: "http"
  druid_s3_enablePathStyleAccess: "true"

One other useful property is the druid image version that’s being deployed. Add this section to the yaml file to change it to any version you need to work with:

image:
  tag: 0.22.1

Step 4 – Deploy the Druid cluster:

helm install druid helm/druid --namespace dev -f k8s_minikube.yaml

W0218 15:36:46.492231   25538 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W0218 15:36:46.754347   25538 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
NAME: druid
LAST DEPLOYED: Fri Feb 18 15:36:46 2022
NAMESPACE: dev
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:  the notes include port forwarding instructions, but we are using a slightly different parsing to find the router pod (see below)

Step 5 – We’ll need to redirect port 8888 from localhost into the cluster’s router service which provides the Apache Druid UI:

export PODNAME=$(kubectl get po -n dev | grep router | cut -d" " -f1)

kubectl port-forward pod/$PODNAME 8888 -n dev

Step 6 – You may need to wait a bit until all pods are ready. Some may restart as they timeout waiting for dependencies to get started, but they will ultimately all end up in READY 1/1 state. You can monitor the startup with the following command:

kubectl get pods -n dev

We’re done! Now you can use Druid by accessing http://localhost:8888 and run through some tutorials.

Some Additional Kubernetes Tools

We’ll leave you with a few useful commands to manage the cluster.

Pods

You can see which pods are running on the cluster by using:

kubectl get pods -n dev

NAME                                 READY   STATUS    RESTARTS       AGE
druid-broker-744c5f46b7-crxbg        1/1     Running   1 (4m8s ago)   7m28s
druid-coordinator-7c79f9c6c9-4wg67   1/1     Running   1 (4m8s ago)   7m28s
druid-historical-0                   1/1     Running   0              7m28s
druid-middle-manager-0               1/1     Running   0              7m28s
druid-postgresql-0                   1/1     Running   0              7m28s
druid-router-84d7cc6d87-qkl9m        1/1     Running   0              7m28s
druid-zookeeper-0                    1/1     Running   0              7m28s
druid-zookeeper-1                    1/1     Running   0              4m26s
druid-zookeeper-2                    1/1     Running   0              3m54s
minio-844b956d8b-tfcq6               1/1     Running   0              39m

Notice that middle managers and historicals are deployed as stateful sets and therefore have a predetermined naming convention while router, broker, coordinator are all using ephemeral names. If you feel adventurous and want to see what’s inside, you can open a shell into any of the pods using:

kubectl exec -it <podname> -n dev -- /bin/sh

Logs

You can retrieve the logs of any of the pods with:

kubectl logs <podname> -n dev

Add a -f to tail the log indefinitely, very useful when debugging!

Minikube Cluster Operations

minikube stop     – stops the whole kubernetes cluster, use this if you want to be able to use the same deployments later. Any loaded data will still be available at restart.

minikube start – restarts the cluster and brings back any install deployments (minio and druid).

minikube delete – this is destructive, it will wipe out everything including the kubernetes cluster, restarting will require a new set of helm install commands to redeploy. 

If you want to have a Druid cluster handy but without using resources while you aren’t using it, you can use minikube stop to pause and minikube start to resume. You will need to restart the port-forwarding commands to gain access to the corresponding UIs after restart completes.

Chart Operations

To remove a deployment you can use helm uninstall:

helm uninstall druid -n dev  removes druid deployment.

helm uninstall minio -n dev  removes min.io deployment.

If you want to make changes to the configuration, update the yaml file and apply the changes live with:

helm upgrade druid helm/druid --namespace dev -f k8s_minikube.yaml

helm upgrade minio minio/minio --namespace dev -f minio_values.yaml

What’s Next…

We’re planning a series of Druid on Kubernetes related blog posts. Coming soon, we’ll delve into how to use the helm chart to deploy to the cloud using more CPU, memory and storage. Also feel free to reach out with comments or suggestions at sergio.ferragut@imply.io.

Other blogs you might find interesting

No records found...
Jun 12, 2024

Why I Joined Imply

After reviewing the high-level technical overview video of Apache Druid and learning about how the world's leading companies use Apache Druid, I immediately saw the immense potential in the product. Data is...

Learn More
Jun 12, 2024

AWS IoT Core and Imply: Simplifying IoT Management and Data Analytics

The Internet of Things (IoT) is rapidly expanding, with IoT connections growing by 18% in 2022. AWS IoT Core, used by giants like NASA and Volkswagen, offers robust solutions for managing this surge. Discover...

Learn More
May 13, 2024

Tuning into Success: Streaming Music Done the Imply Way

Learn how a top streaming music service uses Imply to provide visibility into audience engagement and other key metrics.

Learn More

Let us help with your analytics apps

Request a Demo