With on premise setups, compute/storage separation is often implemented using a NAS or similar storage unit that exposes an S3 API endpoint.
I want to emulate S3 related behavior in a self contained demo that I can run on my laptop without an internet connection. This is conveniently done using MinIO as my S3 compatible storage.
Let’s deploy MinIO using this docker compose file:
version: "3"
services:
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
networks:
minio_net:
aliases:
- druid.minio
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
minio_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: >
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc rm -r --force minio/indata;
/usr/bin/mc mb minio/indata;
/usr/bin/mc policy set public minio/indata;
/usr/bin/mc rm -r --force minio/deepstorage;
/usr/bin/mc mb minio/deepstorage;
/usr/bin/mc policy set public minio/deepstorage;
tail -f /dev/null
"
networks:
minio_net:
Save this file as docker-compose.yaml
to your work directory and run the command
docker compose up -d
This gives us a MinIO instance and the mc
client. It will also automatically create two buckets in MinIO, named indata
and deepstorage
, that we will need for this tutorial. If you point your browser to localhost:9000, you can verify that the buckets have been created:
(Kudos to Tabular from whose GitHub repository I adapted the docker compose file.)
Configuring MinIO as deep storage and log target
I am using the standard Druid 27.0 quickstart. If you want to start Druid using the new start-druid
script, you find the relevant configuration settings in conf/druid/auto/_common/common.runtime.properties
under your Druid installation directory.
First of all, we need to load the S3 extension by adding it to the load list – it should look similar to this:
druid.extensions.loadList=["druid-s3-extensions", "druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query"]
Also configure the S3 default settings (endpoint, authentication):
druid.s3.accessKey=admin
druid.s3.secretKey=password
druid.s3.protocol=http
druid.s3.enablePathStyleAccess=true
druid.s3.endpoint.signingRegion=us-east-1
druid.s3.endpoint.url=http://localhost:9000/
For using MinIO as deep storage, comment out the default settings for druid.storage.*
, and insert this section instead:
druid.storage.type=s3
druid.storage.bucket=deepstorage
druid.storage.baseKey=segments
Likewise, change the default configuration for the indexer logs to:
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=deepstorage
druid.indexer.logs.s3Prefix=indexing-logs
Then start Druid like this:
bin/start-druid -m5g
Ingesting data from MinIO
By default, Druid uses the same settings in common.runtime.properties
for ingestion from S3, too. So for instance, you can upload the wikipedia
data sample to the indata
bucket in your MinIO instance and we take advantage of the same settings as for deep storage. Just use s3://indata/
as the S3 prefix in the ingestion wizard, and it should work out of the box.
Here is my example JSON ingestion spec:
{
"type": "index_parallel",
"spec": {
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "s3",
"prefixes": [
"s3://indata/"
]
},
"inputFormat": {
"type": "json"
}
},
"tuningConfig": {
"type": "index_parallel",
"partitionsSpec": {
"type": "dynamic"
}
},
"dataSchema": {
"dataSource": "wikipedia_s3_2",
"timestampSpec": {
"column": "time",
"format": "iso"
},
"granularitySpec": {
"queryGranularity": "none",
"rollup": false,
"segmentGranularity": "day"
},
"dimensionsSpec": {
"dimensions": [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{
"type": "long",
"name": "delta"
},
{
"type": "long",
"name": "added"
},
{
"type": "long",
"name": "deleted"
}
]
}
}
}
}
Or in SQL (using the automatic conversion function):
REPLACE INTO "wikipedia_s3_2" OVERWRITE ALL
WITH "source" AS (SELECT * FROM TABLE(
EXTERN(
'{"type":"s3","prefixes":["s3://indata/"]}',
'{"type":"json"}'
)
) EXTEND ("time" VARCHAR, "channel" VARCHAR, "cityName" VARCHAR, "comment" VARCHAR, "countryIsoCode" VARCHAR, "countryName" VARCHAR, "isAnonymous" VARCHAR, "isMinor" VARCHAR, "isNew" VARCHAR, "isRobot" VARCHAR, "isUnpatrolled" VARCHAR, "metroCode" VARCHAR, "namespace" VARCHAR, "page" VARCHAR, "regionIsoCode" VARCHAR, "regionName" VARCHAR, "user" VARCHAR, "delta" BIGINT, "added" BIGINT, "deleted" BIGINT))
SELECT
TIME_PARSE("time") AS "__time",
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
"delta",
"added",
"deleted"
FROM "source"
PARTITIONED BY DAY
In either case, you can easily verify that both the segment files and the indexer logs end up in MinIO.
Changing the endpoint settings in the ingestion command
Now let’s go back to local deep storage, so that we cannot take advantage of endpoint settings that are baked into the service properties file. Hence we need to establish those settings right in the ingestion spec.
Restore the common properties to their default values and restart Druid. (You still need the S3 extension loaded.)
JSON version
Start the wizard as for a standard S3 ingestion. Then switch to the JSON view and edit the S3 settings in the ingestion spec:
"inputSource": {
"type": "s3",
"prefixes": [
"s3://indata/"
],
"properties": {
"accessKeyId": {
"type": "default",
"password": "admin"
},
"secretAccessKey": {
"type": "default",
"password": "password"
}
},
"endpointConfig": {
"url": "http://localhost:9000",
"signingRegion": "us-east-1"
},
"clientConfig": {
"disableChunkedEncoding": true,
"enablePathStyleAccess": true,
"forceGlobalBucketAccessEnabled": false
}
}
Note: In this case, because we are using plain HTTP, we need to include the http://
in the endpoint URL. If we put it in the clientConfig.protocol
, as you might think from the sample in the documentation, it is not recognized.
SQL version
In the SQL version, we copy the same settings into the EXTERN statement, like so:
REPLACE INTO "wikipedia_s3_2" OVERWRITE ALL
WITH "source" AS (SELECT * FROM TABLE(
EXTERN(
'{ "type": "s3", "prefixes": [ "s3://indata/" ], "properties": { "accessKeyId": { "type": "default", "password": "admin" }, "secretAccessKey": { "type": "default", "password": "password" } }, "endpointConfig": { "url": "http://localhost:9000", "signingRegion": "us-east-1" }, "clientConfig": { "disableChunkedEncoding": true, "enablePathStyleAccess": true, "forceGlobalBucketAccessEnabled": false } }',
'{"type":"json"}'
)
) EXTEND ("time" VARCHAR, "channel" VARCHAR, "cityName" VARCHAR, "comment" VARCHAR, "countryIsoCode" VARCHAR, "countryName" VARCHAR, "isAnonymous" VARCHAR, "isMinor" VARCHAR, "isNew" VARCHAR, "isRobot" VARCHAR, "isUnpatrolled" VARCHAR, "metroCode" VARCHAR, "namespace" VARCHAR, "page" VARCHAR, "regionIsoCode" VARCHAR, "regionName" VARCHAR, "user" VARCHAR, "delta" BIGINT, "added" BIGINT, "deleted" BIGINT))
SELECT
TIME_PARSE("time") AS "__time",
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
"delta",
"added",
"deleted"
FROM "source"
PARTITIONED BY DAY
Conclusion
- You can use MinIO or another S3 compatible storage with Druid. You configure the endpoint, protocol, and authentication settings in the common properties file.
- If you need to ingest from a different MinIO instance, or you want to use MinIO for ingestion only, you can set or override the S3 settings in the ingestion spec. This works both in JSON and SQL mode.
- Either way, make sure you have the S3 extension loaded.