Keeping up with changing schemas in streaming data

Streaming data, continuously generated in real-time from various sources at a rapid pace, presents significant challenges to those working with it. The task is to efficiently ingest and query this ever-changing data as it flows through platforms like Apache Kafka and Amazon Kinesis.

At first glance, one might assume that a strongly typed database can keep pace with streaming data. However, the reality is that the data definition process poses a common obstacle. Developers’ productivity is notably influenced by the management of schema changes, which poses a significant challenge.

Strongly typed databases feature rigid structures, requiring coordinated efforts between developers and data administrators to adapt to changes in streaming event data. This often involves cumbersome communication channels such as emails and meetings, resulting in widespread frustration.

Apache Druid: Balancing Performance and Flexibility

Now, let’s delve into how Apache Druid tackles the challenge of staying aligned with changing schemas in streaming data. It achieves this by offering the performance of a strongly typed data structure with the flexibility of a schemaless data structure.

Understanding Data Structure Paradigms

To better grasp the significance of this advancement, let’s delve into the data structure paradigms:

Strongly-typed Data Data Structure

__time (timestamp)	IP (string)	Code (long)	Action (string)
1695063046	34.121.120.87	0	null
1695063048	147.75.40.150	-1	connect
1695063049	3.163.24.77	42	null

Figure 1: Example table for a strongly-typed data structure

This type enforces strict type checking in programming, ensuring a predefined structure for organizing and accessing data. In the context of databases, it relies on schemas to define data’s name, type, and format within a database table. Schemas are pivotal for efficient query performance, providing a structured framework for data interpretation.

Pros	Cons
Fast to query	Data that doesn’t fit the schema is dropped or the whole event fails to ingest
Easy to identify all data fields	Most schema changes require taking the table offline

Figure 2: Summary of the pros/cons for a strongly-typed data structure

Schemaless Data Structure

{“time”:1695063046, “IP”:”34.121.120.87”, “code”:”0”}
{“time”:1695063048, “IP”:”147.75.40.150”, “code”:”-1”,”action”:”connect”}
{“time”:1695063049, “IP”:”3.163.24.77”, “code”:”42”,”status”:”active”}

In contrast, a schemaless database offers more flexibility to developers by storing data in various formats, such as key-value pairs, documents, graphs, or wide-column stores. Each record can have its unique structure, allowing for adaptability to changing data without the need for complex schema modifications. However, this flexibility comes at the cost of analytics query performance and data consistency.

Pros	Cons
Every event can have whatever data it needs	Slow to query
Data included in each event can change as needs change	Hard to figure out which fields exist in the data set

Figure 3: Summary of the pros/cons for a schemes data structure structure

Schema Auto-Discovery with Apache Druid

Traditionally, developers faced a difficult choice between these two data structure paradigms. Druid, a real-time analytics database, now combines the performance of a strongly typed data structure with the flexibility of a schemaless one. Schema auto-discovery, introduced in Druid 26.0, plays a pivotal role in achieving this balance. It automates the process of identifying data fields, data types, and schema changes, ensuring that Druid tables evolve seamlessly to accommodate new data without requiring reprocessing of existing data.

Example: Schema Auto-Discovery in a Retail Environment

To showcase how Druid excels in automatically discovering column names and data types as streaming data is ingested in real-time, let’s examine a live stream of events from an e-commerce platform. In this scenario, customer interactions are constantly being recorded.

Step 1: Auto Detection for New Tables

Druid can auto-discover column names and data types during ingestion. Let’s examine a snapshot of the data stream:

{“time”:”2023-05-15T12:23:17Z”, “Event Type”:”Price\ Increase”,”ProductID:”4567129”,”Price”:5.29}

Timestamp__time	StringEvent Type	LongProduct ID	DoublePrice
2023-05-15T12:23:17Z	Price Increase	4567129	5.29

In this scenario, Druid identifies the dimensions required for analysis: Time, Event Type, Product ID, and Price. Moreover, it intelligently assigns the appropriate data types to each column. For instance, ‘Product ID’ is recognized as a Long integer, while ‘Price’ is identified as a Double.

This streamlined approach simplifies the data ingestion process dramatically. Developers can now seamlessly feed their streaming data into Druid, eliminating the need for extensive manual schema definitions.

Step 2: Maintenance of Existing Tables as Data Sources Change

Let’s assume this e-commerce platform now has access to real-time location data from customers’ mobile devices. As customers browse the platform and interact with products, the company wants to leverage this new data by offering them promotions and discounts based on their current geographic location.

{“time”:”2023-06-15T22:02:51Z”, “Event Type”:”New\ Product”,”ProductID:”8790456M”,”Price”:7.85, ”Latitude”:“40.7128”, “Longitude”:“74.0060”}

Timestamp__time	StringEvent Type	StringProduct ID	DoublePrice	DoubleLatitude	DoubleLongitude
2023-05-15T12:23:17Z	Price Increase	4567129	5.29	null	null
2023-06-15T22:02:51Z	New Product	8790456M	7.85	40.7128	74.0060

As you can see from the above table, Druid automatically evolved the schema to match the incoming streaming data. This involved two things:

Auto-detecting data type changes

Druid changed the data type for the ProductID dimension from “Long” to “String” to accommodate the new product’s identifier format.

Modifying Druid tables when dimensions or data types are added, dropped, or changed in the source data

Druid also automatically discovered the new location data and added two new columns for latitude and longitude with the appropriate “Double” data type.

Why is this important?

By adding columns for customer location on the fly, this e-commerce platform is able to analyze the location data in real-time and offer customers relevant discounts or promotions based on their location, enhancing their shopping experience.

Druid: Uniquely Built for Analyzing Streams

Apache Druid is the leading database for real-time analytics, ingesting and querying streaming data at subsecond speed and scale. From its inception, Druid was designed to enable real-time analytics on stream data. With native, connector-less support for all leading streaming platforms, including Kafka, Kinesis, and Pulsar, Druid ensures each event is immediately available for querying with the highest data reliability. And with support for schema auto-discovery, developers are assured that every Druid table will match incoming streaming data, even as the streams evolve.