Working with Nested JSON in Druid

Complex Data in a Simple Format

JSON has evolved from an adjunct JavaScript format to a widely-adopted and widely-used method for storing and sharing data. It’s readable, lightweight, and supported by every commonly-used programming language.

To store more complex concepts, JSON uses nested objects and nested arrays. While this provides an easy (and human-readable) way to store data, it doesn’t work well with relational databases that store data in tables of rows and columns. While nested data can be stored in a relational database field, it’s slow to query and difficult or impossible to index. The usual process is to pre-process and flatten nested arrays before inserting the data into database tables, adding time and complexity.

It’s much easier in Apache Druid®, which supports storing nested data structures as a column in a table. Druid can read and query nested columns with the same high performance as other data types.

An Example of Nested JSON

A typical example:

{
"time":"2023-6-14T10:32:08Z",
"MainId": 1701,
"firstName": "Sherlock",
"lastName": "Homes",
"categories": [
{
	"CategoryID": 16,
	"CategoryName": "Detective"
},
{
	"CategoryID": 96,
	"CategoryName": "British”
}
]
}

In this case, the JSON contains five fields, of which one (“categories”) is an array of JSON objects.

Using Nested JSON

One option is to flatten the nested data during ingestion, changing the nested JSON into multiple table rows, using either Druid flattenspec or SQL ingestion commands. This will create a denormalized table:

timeMainIdfirstNamelastNamecategoryIDcategoryName
2023-6-14T10:32:08Z1701SherlockHolmes16Detective
2023-6-14T10:32:08Z1701SherlockHolmes96British

This has the advantage of making query SQL simpler, but at the cost of creating a larger table. It may also cause challenges with schema auto-detection if the format of the arrays change in ways that no longer match the flattenspec.

Another option: use Druid’s COMPLEX<json> data type. Some or all of the nested JSON data will be stored as a single field, simplifying ingestion. For the example here, it might make sense to keep the “categories” as a COMPLEX<json> field:

timeMainIdfirstNamelastNamecategories
2023-6-14T10:32:08Z1701SherlockHolmes{
“CategoryID”:16,
“CategoryName”:”Detective”
},
{
“CategoryID”:96,
“CategoryName”:”British”
}

Queries can retrieve the full JSON object or any subset using Druid SQL JSON functions.

Because of the indexing and segmentation used by Druid, there is no performance penalty incurred by querying data stored as COMPLEX<json> fields. In some cases, query performance can even be slightly faster than querying flattened tables.

Newsletter Signup

Let us help with your analytics apps

Request a Demo