Everything Changes
Real events are the core of real-time analytics. Apache Druid® ingests data streams from Kafka topics and other sources. What happens when the structure of the streams change?
Changing Data Structures Checklist
- Create Tables from Streaming Data
- Use Schema Auto-discovery
Create Tables from Streaming Data
Druid ingests data from Apache Kafka stream, Kafka-compatible streams (such as Confluent and Redpanda), and Amazon Kinesis stream using an ingestion specification. See Ingesting Stream Data for details about planning and using stream ingestion.
The ingestion specification, once submitted for activation, will create a Druid table. As each new event enters the source stream, the event will become a new row in the table, immediately available for query.
Use Schema Auto-discovery
When incoming data changes over time, Druid can automatically update tables to match the changing data structure.
To activate this capability, include “useSchemaDiscovery”: true in the dimensionsSpec section of the ingestion specification (by default, this is set to false)
For example, if an incoming stream includes:
(“time”:”2023-05-15T12:23:17Z”, “EventType”:”Price Increase”,”ProductID”:4567129,”Price”:5.29}(“time”:”2023-05-15T14:12:49Z”, “EventType”:”New Product”,”ProductID”:6784590,”Price”:7.85}
… then Druid will auto-detect data types and create a table:
Timestamp__time | StringEventType | LongProductID | DoublePrice |
2023-05-15T12:23:17Z | Price Increase | 4567129 | 5.29 |
2023-05-15T14:12:49Z | New Product | 6784590 | 7.85 |
If a future event in the stream is:
(“time”:”2023-06-11T16:14:32Z”, “EventType”:”New Product”,”ProductID”:8790456M,”Price”:7.85,”CarbonNeutral”:true}}
… then Druid will realize that two things have changed:
First, the ProductID field can’t be a “long” data type, as it contains both numbers and letters. So the table structure will be automatically changed, with ProductID updated from “long” to “string”:
But there is also a whole new field in the stream JSON: CarbonNeutral. So Druid will add a new column to the table. Since this field didn’t exist for earlier entries, Druid will assign the “null” value for this new column:
Timestamp__time | StringEventType | StringProductID | DoublePrice | LongCarbonNeutral |
2023-06-11T12:05:17Z | Price Increase | 4567129 | 5.29 | null |
2023-06-11T14:10:23Z | New Product | 6784590 | 7.85 | null |
2023-06-11T16:14:32Z | New Product | 8790456M | 10.5 | 1 |
As the structure of the incoming stream changes over time, Druid will continue to automatically change the table to match.