Overview
As digital technologies embed themselves into more aspects of everyday life, the Internet of Things (IoT) is becoming increasingly important. A catchall term for sensors, monitors, and other smart machines, IoT devices bridge the gap between the digital and physical worlds, enabling data to be analyzed, processed, and acted on.
Today, IoT data provides value to every sector of the economy, from manufacturing to logistics to retail. By collecting telemetry and metrics from drones, delivery trucks, medical devices, construction equipment, security cameras, and much more, IoT devices offer a real-time picture of operating environments.
For instance, sensors can detect the angle of the sun and move solar panels into the optimal position to catch the most sunlight. Smart meters on streetlights can assess pedestrian traffic, analyze electricity usage, and alter the brightness accordingly. Devices on assembly lines can detect outliers (such as high temperatures) and trigger events in response (such as production stoppages or fire suppression).
In all of these scenarios, IoT machines enable clearer insights and more efficient operations. By more easily understanding usage patterns, companies can predict shifts in customer behavior, design the next generation of products, automate routine tasks (thus freeing up human employees for more interesting duties), and prevent issues before they occur.
Requirements
However, IoT data also brings unique challenges for organizations. This data tends to be real time in nature, arriving in high volumes and various formats, such as metrics, telemetry, logs, and more. To complicate matters, IoT data is often ingested through massive, high-speed streams and is also highly perishable—especially if teams don’t have the proper data infrastructure in place.
In order to leverage IoT data to the fullest, organizations require a database that can:
Support ad-hoc queries across massive time series datasets. IoT devices generate huge quantities of timestamped data. However, many traditional databases are not optimized for collecting, organizing, storing, and analyzing time series data. Some products may lack key features such as densification and gap filling, while others may ingest or structure their time series data inefficiently, causing query latency or duplicating data points.
Query data on arrival in order to preserve the freshness of IoT data. To extract maximum value from IoT data, teams need to act on events in seconds or less, as anomalies can quickly escalate into more serious issues without prompt intervention. As a result, teams need a data platform that can ingest IoT events at high speed, make them immediately available for querying, and return subsecond results—regardless of user traffic.
Scale elastically. IoT data can fluctuate wildly depending on time of day, location, movement, or many other variables. In order to keep pace with changing data volumes, databases have to scale seamlessly and automatically, ideally without human intervention or downtime.
Ensure speed, regardless of scale and concurrency. Even as IoT datasets grow into terabytes and petabytes, databases still need to provide subsecond performance. Also, as the number of users and queries increase, databases still have to provide fast results even under load.
Provide robust reliability and durability. IoT devices continuously generate data, which means that databases cannot afford any downtime or data loss. In addition, IoT data accumulates over the course of time, so databases will need an efficient, long-term storage layer.
Solution
Built for speed, scale, and streaming data, Apache Druid can easily handle the challenges of IoT data. Druid provides native support for streaming data technologies like Apache Kafka and Amazon Kinesis, removing the need for external connectors or additional workarounds.
Druid makes data immediately available for querying, a key advantage given the time-sensitive nature of IoT data. While databases typically ingest events in batches and persist them to files before they can be accessed by users, Druid ingests stream data by event and directly into memory at data nodes, where it can be instantly queried. This ensures that important, time-sensitive data is immediately available.
Druid is also designed for both reliability and durability. After ingestion, events are processed and organized into columns and segments before being persisted into a deep storage layer, which serves as continuous backup. If nodes (servers) go offline, the workload is distributed across other existing nodes, and the data previously stored on the failed node is pulled from deep storage and loaded onto other nodes. This ensures that data is never unavailable.
While Druid’s architecture incorporates concepts from time series databases (such as time-based partitioning and fast aggregation), Druid goes even further with its extensive analytics capabilities. For instance, teams can use Druid to slice and dice operations across multiple dimensions, group data by various parameters such as location or sensor type, and filter results by criteria such as IP address. This flexibility and versatility in aggregating data makes it easier for data engineers to find the answers they need.
Lastly, Druid can scale seamlessly, thanks to its unique relationship between storage and compute. Druid processes are divided into master nodes for data availability and ingestion, query nodes for executing queries and retrieving results, and data nodes for storage and executing ingestion. Each node type can be scaled independently depending on usage.
Druid also includes deep storage, a common data store for increased data durability and streamlined scaling. When nodes are scaled up or down to keep pace with demand, Druid simply pulls data from deep storage, rebalancing the data segments on each node to maintain performance. In addition, deep storage serves as a safeguard, ensuring that data will be persisted, readily available to all Druid nodes, and protected from loss.
Customer story
The cloud intelligence arm of Cisco Systems, ThousandEyes provides visibility into network, application, cloud, and Internet performance and traffic. ThousandEyes counts 180 of the Fortune 500 companies, all of the top 10 US banks, and innovative organizations including Digital Ocean, Twitter, DocuSign, and Ebay among their clients.
Nearly everything in a modern enterprise ecosystem, such as network switches, wireless access points, routers, and load balancers, generates large amounts of streaming data, which must be ingested, analyzed, queried, and compared against historical performance in real time to enable reliable and consistent performance. ThousandEyes users need to explore their data in an open-ended manner, such as isolating network traffic by country, determining the regions or sources that were causing issues, and drilling down to identify the root causes.
At its inception, ThousandEyes used MongoDB, a transactional database that proved unsuitable for executing analytics on these massive quantities of IoT sensor data. Using MongoDB, ThousandEyes dashboards took 15 minutes or more to load, complicating troubleshooting and making it impossible to access and analyze data with the speed that customers require.
After examining alternatives, the ThousandEyes team adopted Druid. “Druid is basically the best ecosystem for handling large amounts of data,” Lead Software Engineer Gabe Garcia says. “We have between five to 20 requests per second, and we have subsecond latency for most of our queries (our 98th percentile latency is about a second).”
For ThousandEyes customers, the move to Druid provided a tenfold reduction in dashboard latency, and more importantly, enabled them to visualize their entire network topology across all their devices—so they could track down interface issues in seconds to resolve (or even prevent) network issues.
For more information about Druid, read our architecture guide.
For the easiest way to get started with real-time analytics, start a free trial of Polaris, the fully managed, Druid database-as-a-service by Imply.