Working with changing data

Everything Changes

Real events are the core of real-time analytics. Apache Druid® ingests data streams from Kafka topics and other sources. What happens when the structure of the streams change?

Changing Data Structures Checklist

  • Create Tables from Streaming Data
  • Use Schema Auto-discovery

Create Tables from Streaming Data

Druid ingests data from Apache Kafka stream, Kafka-compatible streams (such as Confluent and Redpanda), and Amazon Kinesis stream using an ingestion specification. See Ingesting Stream Data for details about planning and using stream ingestion.

The ingestion specification, once submitted for activation, will create a Druid table. As each new event enters the source stream, the event will become a new row in the table, immediately available for query.

Use Schema Auto-discovery

When incoming data changes over time, Druid can automatically update tables to match the changing data structure.

To activate this capability, include “useSchemaDiscovery”: true in the dimensionsSpec section of the ingestion specification (by default, this is set to false)

For example, if an incoming stream includes:

(“time”:”2023-05-15T12:23:17Z”, “EventType”:”Price Increase”,”ProductID”:4567129,”Price”:5.29}(“time”:”2023-05-15T14:12:49Z”, “EventType”:”New Product”,”ProductID”:6784590,”Price”:7.85}

… then Druid will auto-detect data types and create a table:

Timestamp__timeStringEventTypeLongProductIDDoublePrice
2023-05-15T12:23:17ZPrice Increase45671295.29
2023-05-15T14:12:49ZNew Product67845907.85

If a future event in the stream is:

(“time”:”2023-06-11T16:14:32Z”, “EventType”:”New Product”,”ProductID”:8790456M,”Price”:7.85,”CarbonNeutral”:true}}

… then Druid will realize that two things have changed:

First, the ProductID field can’t be a “long” data type, as it contains both numbers and letters. So the table structure will be automatically changed, with ProductID updated from “long” to “string”:

But there is also a whole new field in the stream JSON: CarbonNeutral. So Druid will add a new column to the table. Since this field didn’t exist for earlier entries, Druid will assign the “null” value for this new column:

Timestamp__timeStringEventTypeStringProductIDDoublePriceLongCarbonNeutral
2023-06-11T12:05:17ZPrice Increase45671295.29null
2023-06-11T14:10:23ZNew Product67845907.85null
2023-06-11T16:14:32ZNew Product8790456M10.51

As the structure of the incoming stream changes over time, Druid will continue to automatically change the table to match.

Newsletter Signup

Let us help with your analytics apps

Request a Demo