Overcome tradeoffs with schemaless databases

Schemaless databases, such as MongoDB and other NoSQL variants, offer developers unmatched flexibility but often come with trade-offs, particularly in query performance. In this article, we explore the challenges posed by schemaless databases and introduce Druid, a groundbreaking database that seamlessly combines schema flexibility with high-performance capabilities, eliminating the need for trade-offs.

The Emergence of Schemaless Databases

Flexibility vs. Performance	Deliver suboptimal query performance due to factors like metadata lookups and the absence of strongly-typed data
Complex Analytical Queries	The absence of a fixed schema can complicate analytical queries, requiring additional metadata lookups and potentially resulting in slower query processing
Full Table Scans	Schemaless databases may resort to full table scans in some scenarios, especially with large datasets, causing significant performance bottlenecks
Indexing and Query Planning	Optimizing indexing and query planning can be challenging in schemaless databases compared to relational databases with fixed schemas.

Figure 1: Summary of Challenges of Schemaless Databases

Schemaless databases emerged as a response to the limitations of traditional relational databases. They store data in various formats, such as key-value pairs, documents, graphs, or wide-column stores. Each record can have its unique structure, allowing for adaptability to changing data without the need for complex schema modifications. They offer more flexibility in data modeling, allowing for dynamic or evolving data structures, making them well-suited for development environments where requirements can change rapidly.

Unfortunately, the absence of a rigid schema means data can vary in structure, making analytical queries more complex and computationally intensive. They often require additional metadata lookups during query processing to understand the data’s structure, introducing overhead that can slow down queries compared to databases with fixed schemas. Secondly, schemaless databases may resort to full table scans in some cases, where every document or record in the dataset must be examined to find relevant information, leading to significantly slower query performance, particularly with large datasets. Lastly, the flexibility and less structured nature of schemaless databases make it challenging to optimize indexing and query planning, unlike relational databases where specific column-based indexing can be employed.

Benefits of Schema Auto-Discovery in Druid

Druid is the first analytics database that can provide the performance of a strongly-typed data structure with the flexibility of a schemaless data structure. Schema auto-discovery, introduced in Druid 26.0, enables Druid to automatically discover data fields and data types and update tables to match changing data. This means Druid will look at the ingested data and identify what dimensions need to be created and the data type for each dimension’s column. And even better, as schemas change, Druid will automatically discover the change – dimensions or data types are added, dropped, or changed in the source data – and adjust Druid tables to match the new schema without requiring the existing data to be reprocessed.

Example: Real-time Analytics with Schema Auto-Discovery

Step 1: Auto Detection for New Tables

To showcase how Druid delivers the flexibility of a schemaless database for real-time analytics, let’s look at how an ecommerce company is able to make fast, informed decisions from a live stream of data. In this scenario, customer interactions are constantly being recorded.

Druid can auto-discover column names and data types during ingestion. Let’s examine a snapshot of the data stream:

Timestamp__time	StringEvent Type	LongProduct ID	DoublePrice
2023-05-15T12:23:17Z	Price Increase	4567129	5.29

In this scenario, Druid identifies the dimensions required for analysis: Time, Event Type, Product ID, and Price. Moreover, it intelligently assigns the appropriate data types to each column. For instance, ‘Product ID’ is recognized as a Long integer, while ‘Price’ is identified as a Double.

This streamlined approach simplifies the data ingestion process dramatically. Developers can now seamlessly feed their streaming data into Druid, eliminating the need for extensive manual schema definitions.

Step 2: Maintenance of Existing Tables as Data Sources Change

Let’s assume this e-commerce platform now has access to real-time location data from customers’ mobile devices. As customers browse the platform and interact with products, the company wants to leverage this new data by offering them promotions and discounts based on their current geographic location.

Timestamp__time	StringEvent Type	StringProduct ID	DoublePrice	DoubleLatitude	DoubleLongitude
2023-05-15T12:23:17Z	Price Increase	4567129	5.29	null	null
2023-06-15T22:02:51Z	New Product	8790456M	7.85	40.7128	74.0060

As you can see from the above table, Druid automatically evolved the schema to match the incoming streaming data. This involved two things:

Auto-detecting data type changes

Druid changed the data type for the ProductID dimension from “Long” to “String” to accommodate the new product’s identifier format.

Modifying Druid tables when dimensions or data types are added, dropped, or changed in the source data

Druid also automatically discovered the new location data and added two new columns for latitude and longitude with the appropriate “Double” data type.

Why is this important?

By adding columns for customer location on the fly, this e-commerce platform is able to instantly analyze the location data in real-time and offer customers relevant discounts or promotions based on their location, significantly enhancing their shopping experience.

Druid: Uniquely built for real-time analytics

Apache Druid is the leading database for real-time analytics, ingesting and querying streaming data at subsecond speed and scale. From its inception, Druid was designed to enable real-time analytics on stream data. With native, connector-less support for all leading streaming platforms, including Kafka, Kinesis, and Pulsar, Druid ensures each event is immediately available for querying with the highest data reliability. And with support for schema auto-discovery, developers are assured that every Druid table will match incoming streaming data, even as the streams evolve.

Schema Auto-Discovery in Apache Druid: A Developer’s Advantage

In summary, Apache Druid’s schema auto-discovery feature allows developers to leverage the flexibility of schemaless databases without sacrificing query performance. This capability facilitates real-time analytics and data-driven decision-making, making it easier for developers to work with dynamic data structures. With Druid, you get both data flexibility and performance, making it a practical choice for agile development.