The Significance of Schema Auto-Discovery in Apache Druid

Druid can now automatically update your database schema without operational headaches.

Many developers say that managing schema changes is always a challenge and impacts their productivity.

That’s because developers are slowed when they need to communicate and coordinate with data administrators or other teams. Sometimes the needs of an application require a change to the data that are collected and processed. If making this change requires a lot of emails and meetings, it takes a lot of time and leaves everyone a bit frustrated. Today, we’re making it much easier for developers to make whatever changes are needed because the database will now keep up with the new requirements automatically.

In this blog article I’ll unpack schema auto-discovery, a new feature now available in Druid 26.0, that enables Druid to automatically discover data fields and data types and update tables to match changing data. This helps Druid deliver the performance of a strongly-typed data structure with the flexibility of a schemaless data structure.

What is a strongly-typed data structure?

A strongly-typed data structure refers to a data structure in programming that enforces strict type checking. For databases, this means enforcing a strict, pre-defined data structure for organizing and accessing data.

As you likely know, this is where a database’s schema comes into play, which defines the name, type, and format of the data stored in a database table. The schema is part of what helps turn data into useful information because it is critical for driving query performance.

Schema changes are changes to tables, columns, anything which can be changed within a schema. Often it is related to adding or deleting a column, or changing some column requirements like data type or null-ability. And of course, when a schema is changed, it creates a ripple through all the applications that depend on that schema.

Any schema change has to be carefully planned, tested, and communicated to the team to minimize the risk of errors or data loss. That’s why with a strongly-typed data structure, a schema change can take weeks for developers to deal with while they adapt their code to the new model.

The whole process of planning, managing, and executing schema changes is so difficult that many developers have considered the benefits of “schemaless” alternatives.

What is a schemaless data structure?

A schemaless database enables more flexibility for developers, at the cost of poor performance for analytics queries. It solves the schema problem by changing how data is stored. Instead of a rigid structure of tables and rows, data is stored in a variety of formats such as key-value pairs, documents, graphs, or wide-column stores.

Since each document or record can have its own unique structure, schemaless databases are especially great when you are dealing with data with a lot of variety – every record in the database can have its own set of fields and, in effect, its own mini-schema. The data model can also evolve over time to accommodate new fields or attributes. As the data structure changes or new data types emerge, there is no need to modify the existing schema or perform complex migrations.

While not having to define a schema when loading data increases flexibility, it’s at the expense of query performance and data consistency. The database spends more time processing and scanning data during query execution due to the need for metadata lookups, potential full table scans, and challenges in indexing optimization and query planning. Since documents in a table or collection aren’t required to contain the same data fields, you can often get data inconsistencies that create inaccurate results.

Bottomline, without a predefined schema, the database must determine the structure of the data on the fly, resulting in additional overhead,slower query performance, and the potential for misleading results.

Why should developers decide whether to get the performance of a strongly-typed data structure or get the flexibility of a schemaless data structure? With the release of Druid version 26.0, there is now a better option.

The best of both worlds with schema auto-discovery

Druid is the first analytics database that can provide the performance of a strongly-typed data structure with the flexibility of a schemaless data structure.

Schema auto-discovery, introduced in Druid 26.0, enables Druid to automatically discover data fields and data types and update tables to match changing data. This means Druid will look at the ingested data and identify what dimensions need to be created and the data type for each dimension’s column. And even better, as schemas change, Druid will automatically discover the change – dimensions or data types are added, dropped, or changed in the source data – and adjust Druid tables to match the new schema without requiring the existing data to be reprocessed.

Bottomline, when ingesting from batch or streams, developers have the choice of defining the schema explicitly or letting Druid detect and define the schema for them.

Schema auto-discovery example in a retail environment

To better understand how schema auto-discovery works for Druid, let’s look at an example of a large retail store selling groceries.

Step 1: Auto detection for new tables

Druid can auto-discover column names and data types during ingestion.

The below table is highlighting key information for two different items for sale. In this example, Druid looked at the ingested data and identified what dimensions needed to be created for this retail store. These are the columns in the table: Time, EventType, ProductID, and Price. Druid also auto-detected the right data type for each column. For example, the data type for “ProductID” is Long while the datatype for “Price” is Double.

Timestamp __time	String EventType	Long ProductID	Double Price
2023-05-15T12:23:17Z	Price Increase	4567129	5.29
2023-05-15T14:12:49Z	New Product	6784590	7.85

This significantly simplifies data ingestion because developers can now simply “throw their data at Druid.” And start querying their data sooner for faster access to insights.

But what about Day 2 and beyond when data starts to evolve and change? Imagine if this large retail store needs to keep up with food trends by introducing carbon neutral foods for sale?

Step 2: Maintenance of existing tables as data sources change

As the schema changed for this retail store, Druid automatically discovered the change and adjusted Druid tables to match the new schema.

For example, carbon neutral foods were never sold in this grocery store before.

Timestamp __time	String EventType	String ProductID	Double Price	Long CarbonNeutral
2023-06-11T12:05:17Z	Price Increase	4567129	5.29	null
2023-06-11T14:10:23Z	New Product	6784590	7.85	null
2023-06-11T16:14:32Z	New Product	8790456M	10.5	1

As you can see from the above table, Druid automatically evolved the schema to match the incoming raw data. This involved two things:

Auto-detecting data type changes

Unlike existing ProductIDs which only contain numbers, the ProductID for the new carbon neutral product contains the letter “M.” To accommodate this, Druid changed the data type for the ProductID dimension from “Long” to “String.”

Modifying Druid tables when dimensions or data types are added, dropped, or changed in the source data

Druid also automatically discovered the new data coming in and added the appropriate “CarbonNeutral” column with the right “Long” data type (which is how Druid stores boolean values).

To ensure the table does not break, Druid added null values to all previously existing rows to ensure every cell has a value.

Druid is uniquely built for analyzing streams

This enhancement also reinforces Apache Druid’s leadership position as the best database for real-time analytics, where streaming data is ingested and queried at subsecond speed and scale. From Day 1, Druid was designed and built to enable real-time analytics on stream data. With native, connector-less support for all the leading streaming platforms including Kafka, Kinesis, and Pulsar – Druid ingests data event-by-event with exactly-once semantics to ensure each event is immediately available for querying with the highest data reliability.

And now with support for schema auto-discovery, developers are assured every row of every Druid table will have the dimensions and metrics that match incoming streaming data, even as the streams evolve.