Overview
As the world becomes increasingly digital, so too does data become more significant. The rise of data brings an array of endless possibilities for optimizing business operations, building advanced algorithms, informing decision making, and much more.
There are many scenarios where teams need to explore lots of data in an open-ended manner. For instance, a SRE team troubleshooting a microservices architecture will need to dissect their data in order to diagnose and debug the problem. Data engineers, analysts, and scientists need to investigate anomalies, find unusual patterns, and mine data for insights. Technical support teams need to parse reams of data from IoT sensors in order to isolate and investigate outages and performance issues.
Each scenario requires robust dashboards for investigating data from multiple perspectives, using different visualizations such as choropleth maps, stack areas, and bar charts. The best dashboards will enable teams to isolate by dimensions, filter by measures, and slice and dice as needed.
However, many dashboards still run on legacy technologies and cloud data warehouses, which were not designed for speed, flexible exploration of data, or high numbers of users and queries. In the past, product designers assumed that many dashboard uses (such as executive reports or data science analyses) were not time sensitive, and so users could end up waiting 20 minutes (or more) to run queries or other operations on their data.
Many dashboards also do not have much adaptability when it comes to unrestricted data exploration. Because they are limited by their database architecture, these dashboards tend to be templated, with a limited range of widgets and finite drill down options; what functionality that exists can often be slow and hard to use.
Requirements
Time is of the essence—especially since today’s organizations succeed (or fail) based on how quickly they utilize fast-paced, real-time data.
Often, users are not always clear what they need to do or look for; this makes it difficult to prepare data ahead of time utilizing common workarounds with traditional databases, such as pre-aggregations, precomputing, or summarizing and rolling up data in advance. For instance, if a hypothetical SRE is optimizing the performance of a microservices environment, they can’t know what data to aggregate beforehand.
In order to successfully explore their data, users require wide variety and flexibility in their visualizations, including the ability to filter by time, isolate by variables such as place, and zoom in to specific subsets of data. Dashboards also need to accommodate different data types, such as parent-child relationships, nested columns, and more.
They also need depth of insights to successfully complete their tasks. For instance, an online retailer may want to see the aggregate pricing of a product (or product category) over time to analyze profits and losses, adjust profit margins, and improve existing products or develop new ones. Alternatively, engineers need to understand why a wind turbine broke down, so they need to dissect thousands of variables and analyze data without constraints on filters or dimensions.
In addition, many more people, such as data scientists, product owners, and even customers, are now empowered to explore data. Dashboards need to handle high numbers of users and queries, as each user action—whether it’s a zoom in or out, drag and drop, or drill down—is another query on the backend. For databases that were not designed to handle analytics under load, it will be difficult to return results in a rapid, near real-time manner.
Any database must also scale cheaply and effectively. If an organization runs billions of daily events, that equates to terabytes (or petabytes) of data over weeks or months—a challenge for ingesting, storing, and querying this data. Many database architectures are not equipped to deal with high data and query volume at low latency. For instance, transactional (OLTP) databases have speed, but not scale, whereas analytical (OLAP) databases have scale but not speed.
Solution
To succeed, organizations need a database that can offer the best of both worlds—the speed and concurrency of OLTP with the scale of OLAP.
Apache Druid is that database. Built for speed, scale, and streaming data, Druid ingests data and makes it immediately available for queries, analysis, and other operations—no need for batching or preparing data beforehand. Druid can power a wide range of visualizations, enabling greater dimensionality, more flexible exploration, and subsecond responses—regardless of the number of users or queries. With native support for Apache Kafka and Amazon Kinesis, Druid can also support streaming data without relying on additional workarounds or costly connectors.
Thanks to its unique architecture, Druid can also scale independently and seamlessly. Druid processes are divided into master nodes for data availability and ingestion, query nodes for executing queries and retrieving results, and data nodes for storage and executing ingestion. Each node type can be scaled independently (and traffic is rebalanced automatically), ensuring that performance always keeps pace with demand.
In contrast, cloud data warehouses (CDWs) are not designed for sub-second performance at scale. While CDWs can execute complex queries, they cannot do so rapidly, in real time, with high volume of data, or with a high load of users and traffic—making this database type a sub-optimal choice for a highly interactive and explorative frontend.
To enhance the Druid experience, Imply provides Pivot, an engine that streamlines the creation of dashboards and graphics such as heat maps, stack areas, bar graphs, sunbursts, and more. Built on the Druid architecture, Pivot dashboards are optimized for interactivity, ease of use, and speed under load. In addition, Pivot can accommodate different data types and automatically optimize schema for faster, more efficient query results.
Customer story
As one of the first (and largest) CRM software-as-a-service providers for over two decades, Salesforce has over 150,000 customers worldwide. In 2022, Salesforce was ranked #1 for CRM applications, customer service applications, and marketing campaign management applications by IDC, a leading marketing intelligence firm, and continues to report billions of dollars in revenue each year.
Within Salesforce, the Edge Intelligence team is tasked with the ingestion, processing, filtering, aggregation, and querying of billions and trillions of log lines per day. In total, Salesforce ingests 200 million metrics per minute and five billion daily events—much of which is geo-distributed. Each month, Salesforce accumulates dozens of petabytes in their transactional store, five petabytes of log data in their data centers, and almost 200 petabytes in Hadoop storage.
With Druid, Salesforce teams can monitor the product experience—and explore vast amounts of data—in real time. Engineers, product teams, and account representatives can query a vast variety of dimensions, filters, and aggregations for performance and trend analysis and troubleshooting.
Using Druid’s compaction, Salesforce teams also ended up cutting the number of Druid rows by 82%, reducing their storage footprint by 47% and speeding up query times by 30%.
To learn more about Druid, read our architecture guide.
For the easiest way to get started with real-time dashboards, start a free trial of Imply Pivot, an intuitive engine for building interactive visualizations.