Streaming data, continuously generated in real-time from various sources at a rapid pace, presents significant challenges to those working with it. The task is to efficiently ingest and query this ever-changing data as it flows through platforms like Apache Kafka and Amazon Kinesis.
At first glance, one might assume that a strongly typed database can keep pace with streaming data. However, the reality is that the data definition process poses a common obstacle. Developers’ productivity is notably influenced by the management of schema changes, which poses a significant challenge.
Strongly typed databases feature rigid structures, requiring coordinated efforts between developers and data administrators to adapt to changes in streaming event data. This often involves cumbersome communication channels such as emails and meetings, resulting in widespread frustration.
Apache Druid: Balancing Performance and Flexibility
Now, let’s delve into how Apache Druid tackles the challenge of staying aligned with changing schemas in streaming data. It achieves this by offering the performance of a strongly typed data structure with the flexibility of a schemaless data structure.
Understanding Data Structure Paradigms
To better grasp the significance of this advancement, let’s delve into the data structure paradigms:
Strongly-typed Data Data Structure
__time (timestamp) | IP (string) | Code (long) | Action (string) |
1695063046 | 34.121.120.87 | 0 | null |
1695063048 | 147.75.40.150 | -1 | connect |
1695063049 | 3.163.24.77 | 42 | null |
This type enforces strict type checking in programming, ensuring a predefined structure for organizing and accessing data. In the context of databases, it relies on schemas to define data’s name, type, and format within a database table. Schemas are pivotal for efficient query performance, providing a structured framework for data interpretation.
Pros | Cons |
Fast to query | Data that doesn’t fit the schema is dropped or the whole event fails to ingest |
Easy to identify all data fields | Most schema changes require taking the table offline |
Schemaless Data Structure
{“time”:1695063046, “IP”:”34.121.120.87”, “code”:”0”}
{“time”:1695063048, “IP”:”147.75.40.150”, “code”:”-1”,”action”:”connect”}
{“time”:1695063049, “IP”:”3.163.24.77”, “code”:”42”,”status”:”active”}
In contrast, a schemaless database offers more flexibility to developers by storing data in various formats, such as key-value pairs, documents, graphs, or wide-column stores. Each record can have its unique structure, allowing for adaptability to changing data without the need for complex schema modifications. However, this flexibility comes at the cost of analytics query performance and data consistency.
Pros | Cons |
Every event can have whatever data it needs | Slow to query |
Data included in each event can change as needs change | Hard to figure out which fields exist in the data set |
Schema Auto-Discovery with Apache Druid
Traditionally, developers faced a difficult choice between these two data structure paradigms. Druid, a real-time analytics database, now combines the performance of a strongly typed data structure with the flexibility of a schemaless one. Schema auto-discovery, introduced in Druid 26.0, plays a pivotal role in achieving this balance. It automates the process of identifying data fields, data types, and schema changes, ensuring that Druid tables evolve seamlessly to accommodate new data without requiring reprocessing of existing data.
Example: Schema Auto-Discovery in a Retail Environment
To showcase how Druid excels in automatically discovering column names and data types as streaming data is ingested in real-time, let’s examine a live stream of events from an e-commerce platform. In this scenario, customer interactions are constantly being recorded.
Step 1: Auto Detection for New Tables
Druid can auto-discover column names and data types during ingestion. Let’s examine a snapshot of the data stream:
{“time”:”2023-05-15T12:23:17Z”, “Event Type”:”Price\ Increase”,”ProductID:”4567129”,”Price”:5.29}
Timestamp__time | StringEvent Type | LongProduct ID | DoublePrice |
2023-05-15T12:23:17Z | Price Increase | 4567129 | 5.29 |
In this scenario, Druid identifies the dimensions required for analysis: Time, Event Type, Product ID, and Price. Moreover, it intelligently assigns the appropriate data types to each column. For instance, ‘Product ID’ is recognized as a Long integer, while ‘Price’ is identified as a Double.
This streamlined approach simplifies the data ingestion process dramatically. Developers can now seamlessly feed their streaming data into Druid, eliminating the need for extensive manual schema definitions.
Step 2: Maintenance of Existing Tables as Data Sources Change
Let’s assume this e-commerce platform now has access to real-time location data from customers’ mobile devices. As customers browse the platform and interact with products, the company wants to leverage this new data by offering them promotions and discounts based on their current geographic location.
{“time”:”2023-06-15T22:02:51Z”, “Event Type”:”New\ Product”,”ProductID:”8790456M”,”Price”:7.85, ”Latitude”:“40.7128”, “Longitude”:“74.0060”}
Timestamp__time | StringEvent Type | StringProduct ID | DoublePrice | DoubleLatitude | DoubleLongitude |
2023-05-15T12:23:17Z | Price Increase | 4567129 | 5.29 | null | null |
2023-06-15T22:02:51Z | New Product | 8790456M | 7.85 | 40.7128 | 74.0060 |
As you can see from the above table, Druid automatically evolved the schema to match the incoming streaming data. This involved two things:
Auto-detecting data type changes
Druid changed the data type for the ProductID dimension from “Long” to “String” to accommodate the new product’s identifier format.
Modifying Druid tables when dimensions or data types are added, dropped, or changed in the source data
Druid also automatically discovered the new location data and added two new columns for latitude and longitude with the appropriate “Double” data type.
Why is this important?
By adding columns for customer location on the fly, this e-commerce platform is able to analyze the location data in real-time and offer customers relevant discounts or promotions based on their location, enhancing their shopping experience.
Druid: Uniquely Built for Analyzing Streams
Apache Druid is the leading database for real-time analytics, ingesting and querying streaming data at subsecond speed and scale. From its inception, Druid was designed to enable real-time analytics on stream data. With native, connector-less support for all leading streaming platforms, including Kafka, Kinesis, and Pulsar, Druid ensures each event is immediately available for querying with the highest data reliability. And with support for schema auto-discovery, developers are assured that every Druid table will match incoming streaming data, even as the streams evolve.