Today analytics is everywhere—and it’s no longer solely for internal stakeholders. A personal finance portal may provide spending data to consumers, while a fitness app will help users understand and improve their fitness journey. On an organizational level, an aerospace manufacturer may provide performance and maintenance analytics to airlines, while a utility can provide usage and billing data to homeowners.
Clearly, companies are starting to see that external-facing analytics should be a core part of their business. But this customer-facing data has to be interactive and contextual to each user—after all, if they can’t quickly understand this information or explore it flexibly and in real time, they can’t derive value from it.
In addition, new analytics requirements also require completely new ways of thinking and building. Previously, the majority of analytics were for internal users, usually other engineers, data analysts, or data scientists on different teams within the same company. This means looser service-level agreements (SLAs) around query times and performance, higher data latency, and for the most part, lower concurrency due to a smaller user base (larger organizations being the exception). All of this equated to less pressure to rapidly deliver data, or to enable open-ended exploration.
On the surface, transitioning an in-house analytics platform into a product for paying customers may seem simple. After all, all the tooling and code is likely already there, and one may think that it’s only a matter of building a new UI to enable outside users to query data using SQL or other popular data tools.
In truth, if a company decides to expand into external analytics, they will face a completely new landscape. First, any external data product has to perform under load, regardless of how large a dataset may be or how many users are accessing analytics, because latent or incomplete queries are unacceptable. While internal stakeholders may be fine with leaving a report to compile over the course of a day, an outside user will likely be frustrated by such delays and consider switching to a competitor.
Another consideration is reliability. Not only will an outage affect a data product provider, but it will also impact their client organizations as well. For example, if a security platform goes down, its users (like banks or government institutions) will now be left vulnerable as well. As a result, many data products have stringent SLAs for availability and uptime, with downtime incurring monetary consequences (in addition to any customer dissatisfaction or churn).
Ultimately, data products-as-a-service are a completely different animal than internal analytics. Because revenue is tied directly to performance, the stakes are far higher than an application for in-house use.
External-facing data products: an overview
Today, these data products encompass a diverse array of analytics applications across many different fields. While customer-facing data products are unified by their functionality of providing data and analytics to external customers, they can be broadly classified into two groups. The first is a standalone application, such as a monitoring solution or a cybersecurity platform. These SaaS offerings are centered entirely around dissecting and leveraging data, and generally packaged and sold independently. Depending on their niche, they are often compatible with various data sources and environments.
The second category is embedded analytics, which could include the charts and graphs within a fitness tracker, a calendar widget charting time spent in meetings, recommendations for related products and services, or a tool for measuring reader behavior for a wiki. In contrast to standalone data products, this type is part of a larger solution, and can best be described as individual features.
Overall, data products for paying users may have several (or all) of the following capabilities:
Advanced analytics, such as aggregations, GROUP BY, segmentation, window functions, and other operations to extract insights from data. These functions facilitate detailed, granular queries, enabling teams to draw conclusions by indicators such as geographic area, time intervals, and more.
Simulations for modeling different scenarios based on data. This could take the form of a stock trading algorithm forecasting market prices over the next thirty days, or a navigational data product modeling upcoming wind speeds and trajectories for a tropical storm.
Dashboards populated with detailed graphics that can break down vital information in various ways, such as pie charts for percentage, maps for geographical regions, and heatmaps for user attention and activity. Ideally, these dashboards would be interactive, enabling users to zoom in, drill down, or drag and drop in real time.
APIs for integrating data directly into a customer’s website or tools, powering features such as personalization algorithms. For instance, when a user interacts with an application, they receive recommendations on anything from relevant products (for an ecommerce platform) to purchase to possible acquaintances (for a social networking site).
Customizable reporting that dissects results and insights in a variety of formats. Unlike dashboards, these reports can have varying levels of interactivity, but what is important is that users can tailor these graphics to represent the data that matters most to them.
Figure 1: A sample dashboard from Cisco ThousandEyes, a prominent user of Apache Druid.
Collaboration and sharing. Customers should be able to annotate or comment on insights surfaced by analytics features. Similarly, they should be able to share widgets and graphics with their own clients through web browser links, embeddable graphics, or guest accounts. In some cases, customers (or their clients) may want to white label data and graphics with their own branding.
Alerting and automated workflows. In some use cases, a customer needs to be alerted on anomalies or to set triggers for automatic actions. For example, a cybersecurity platform will alert a bank’s fraud prevention team when it detects anomalous credit card activity, while facility operations teams may configure a smart asset management tool to activate climate control when temperatures exceed or drop below specific thresholds.
Building for scale: challenges and opportunities
Because these data products are for external customers, performance measures such as service-level agreements (SLAs) and user satisfaction are of the utmost importance. For consumers (such as a digital marketing team or a SRE department), data products provide important insights that are inherently important to operational efficiency, high performance, and ultimately, profitability.
Without timely access, data products for external customers are perhaps less valuable than they could be. Lacking the real-time element, data product users are forced to adopt a reactive stance, which may be undesirable in some situations. For instance, a bank can’t preemptively stop fraudulent credit card transactions, and has to issue chargebacks after the event occurs, thus hurting the merchant, inconveniencing the credit card holder, and damaging their brand credibility.
Further, this success criteria becomes much harder to fulfill whenever scale is involved—whether it’s data volume, query traffic, or user numbers. In fact, a high rate of queries is particularly tricky, as it can lead to issues such as coordinating multiple parallel query operations, expensive increases in resource usage (particularly if a data product utilizes vertical scaling), and additional overhead for data teams. Ultimately, scale impacts end users, leading to latency, a poor user experience, and in the worst case scenario, loss of revenue from customer churn.
Here are the requirements for any database intended to power an external analytics product.
Scalable and affordable.
Scalability has to be achieved in a cost effective manner. While adding hardware is the easy way to scale a database, it is also expensive—and thus not an option for every organization. Therefore, a database should scale horizontally through sharding or partitioning (a more complex but cheaper approach), but also do so in an intuitive manner—no small task.
Performant under load.
When data streams, user numbers, and query rates spike, data products still have to maintain rapid responses to ensure that end users can access data in real time. Any latency means that dashboards and other tools are no longer interactive, which may lead to customer frustration. As an example, a cybersecurity analyst using a data product to investigate a possible intrusion needs answers quickly, or else they risk allowing in a malicious hacker or delaying critical business transactions.
In fact, heavy traffic and large volumes of queries are inevitable, either during periods of rapid growth or amidst crises. Whatever the case may be, a data product has to consistently perform under load, as customers themselves have revenue, SLAs, or other matters at stake.
One example could be shopping seasons such as Single’s Day in Asia or Black Friday in North America, or games like the Super Bowl or World Cup. In the weeks and months leading up to these events, marketers and advertisers may be intensely busy buying and personalizing ads, running targeted campaigns, segmenting audiences, and much more.
During this time, many members of teams would log onto their marketing platforms to iterate campaigns so that they can meet various metrics, such as conversions, clickthrough rates, and more. It’s very likely that marketing analytics solutions will encounter heavy traffic as marketers, executives, and clients all log on to view the successes (or failures) of their initiatives, and review what could be changed or improved.
Reliable and durable
Resilience and high availability is a core requirement, given the increasingly global nature of commerce today, as well as the high stakes inherent in some fields, such as cybersecurity or banking. Durability is another important requirement—in sectors that require continuity of data for safety or compliance reasons, such as healthcare or finance, data loss can have serious, even life-threatening consequences.
For all organizations, downtime (both planned and otherwise) is a problem. A data product whose database cannot keep pace with the demands required to run analytics, respond to queries promptly, and accommodate user traffic will have negative impacts on provider revenue. This is especially the case when the data product is a core company offering, and something on which the company stakes their reputation.
Streaming remains the most efficient way to ingest data in real time, and to analyze and act upon time-sensitive data and insights. By relying on data streams as their primary ingestion, organizations can effectively fulfill use cases that depend very heavily on highly perishable data, such as observability, personalization, cybersecurity, and more.
Easy to use
Automation and simplification are critical for any database upon which a data product is built, as the less human intervention required, the more time that teams have to focus on building or improving their data product.
Often, a team will prototype and build a commercially viable data product using one database—only to find that this database, which could handle the limited user traffic and query activity of the data product’s early days, will struggle to keep up with growth in customers, queries, and data. These databases tend to be limited by design, such as PostgreSQL, which features a single node architecture, presenting liabilities for scaling and resilience.
By adopting a scalable, distributed database like Apache Druid from the beginning, teams can lay a foundation for future success, ensuring that their database (and the data product) will be able to keep up with customer growth. Equally important, using the right database from the beginning will remove the need for costly, time-consuming migrations and any associated impacts, including time spent onboarding team members or planned downtime.
Ultimately, the fact of the matter is that creating an external, customer-facing data product for scale–whether it takes the form of queries, data volumes, user traffic, or all of the above—is a completely different beast than building analytics for internal use. Not only is there a greater complexity involved, but there are also much higher stakes associated with paying users.
Why Apache Druid for external data products?
Apache Druid is the database built for speed, scale, and streaming data. Not only was Druid created specifically to overcome the challenges of massive datasets, but it was also created to provide reliability, scalability, and availability—an ideal match for companies building data products for outside consumption.
One key feature is Druid’s subsecond query responses for virtually any size data set—a vital advantage for data products, which are often tasked with tracking real-time trends or providing data instantaneously to users. While some databases, including cloud data warehouses such as Snowflake or Databricks, can store and query vast volumes of data, they may not necessarily be able to return results as rapidly as Druid.
To achieve this, Druid uses parallel processing via scatter/gather. Queries are split up and routed to the data nodes that store the relevant segments (scatter), scanning data simultaneously before being pieced back together by broker nodes (gather). For the most performance-sensitive queries, Druid also preloads data into local caches to avoid having to load data for each query.
Druid can scale elastically and seamlessly. Relational databases like MySQL or PostgreSQL are ill suited for the complexities of horizontal scaling approaches such as sharding or partitioning, both of which may cause issues with critical operations such as JOINs. Other databases, such as MongoDB, may struggle to provide analytics on massive datasets.
In contrast, Druid’s architecture is inherently scalable. By design, Druid separates tasks by different node types: data nodes for ingestion and storage, query nodes for executing queries and retrieving results, and master nodes for handling metadata and coordination. Each of these can be scaled up or down depending on demand. To facilitate scaling up or down, Druid can also pull data from deep storage and rebalance it across servers as they are added or removed.
Deep storage (which can take the form of either Hadoop or cloud object storage like Amazon S3) also provides an extra layer of durability. Data is continuously (and automatically) copied into deep storage. When nodes are lost due to errors, outages, or failures, the data stored on the lost node is automatically retrieved from deep storage and loaded onto other, surviving nodes. Druid also does not require any planned downtime for upgrades or maintenance, so that data is always available.
Druid is also uniquely suited to the challenges of real-time, streaming data. Because it is natively compatible with Amazon Kinesis and Apache Kafka, the two most common streaming platforms today, Druid can ingest data without any additional workarounds or connectors. Data is also ingested only once, so that events are not duplicated, but are also instantly made available for queries and analytics.
Finally, Druid provides schema autodiscovery to remove manual maintenance. Rather than defining and maintaining rigid database schema (this includes updating schema to match any changes in data sources), Druid can automatically adjust to changes, modifying fields as needed. For Druid users, schema autodiscovery can also provide the best of both worlds: the performance and organization of a strongly-typed database with the flexibility of a schemaless data structure—simplifying both data ingestion and database upkeep.
Customer story: Cisco ThousandEyes
Cisco ThousandEyes is a pioneering provider of intelligence and monitoring for physical network infrastructure such as routers, switches, load balancers, wireless access points, and more. As proof of their success, ThousandEyes counts many economic powerhouses among their customers, including 180 of the Fortune 500 companies, all of the top 10 American banks, and leading tech companies like Digital Ocean, X (formerly known as Twitter), DocuSign, and Ebay.
From the beginning, ThousandEyes’ business model was built around dashboards that provided analytics for external customers. In this capacity, ThousandEyes dashboards had to support versatile, open-ended data exploration on large amounts of data. Customers needed to analyze data for insights by drilling down, filtering by dimensions such as time or location, executing operations such as GROUP BYs or WHERE, and visualizing metrics. Granular detail is of great importance, as teams need as much information as possible in order to troubleshoot or optimize operations.
Initially, they built their data product using MongoDB, a transactional (OLTP) database that stores data as documents. However, as their data and user volumes soon grew, MongoDB was unable to support rapid analytics at scale. This resulted in significant latency and a lack of timely responses, forcing the ThousandEyes team to cap the amount of data that any single widget could provide.
Unfortunately, these widgets were also the primary avenue for customers to interact with their data. For instance, a network engineering team might execute a GROUP BY to see which regions are experiencing connectivity issues, and from there, drill down deeper to smaller areas to isolate malfunctioning infrastructure. As a result, these data restrictions limited the customizability of these widgets, and further, hindered the flexible, open-ended exploration that these organizations required.
Figure 2. An example of a Cisco ThousandEyes dashboard monitoring user experience.
Today, ThousandEyes uses Apache Druid to manage a massive data infrastructure, which consists of Apache Kafka for streaming and Amazon S3 for cloud storage. First, nearly 1.5 billion daily events are streamed into Druid, equating to 10 terabytes of Druid-compressed data (with the raw amounts being nearly 20-30 times more massive). Under load, Druid can accommodate 20 requests per second (with an average of five requests per second), with an average latency of 200 milliseconds and a 98th percentile latency of one second.
This has significant benefits for ThousandEyes’ external end users. Now, companies can use ThousandEyes dashboards to quickly comb through massive amounts of data, wring out insights through interactive operations like drilling down, filtering, or aggregating, and pinpoint and resolve problems before they escalate. This means shorter mean time to detection and resolution (MTTD/R), better experiences for downstream customers, and a smoother process for DevOps and SRE teams.
“Druid is the best ecosystem for handling large amounts of data,” Cisco ThousandEyes Lead Software Engineer Gabe Garcia explains. “Druid democratizes resources much better, and you can be more cost effective by just deploying less clusters.”
Customer story: VMWare
Founded in 1998, VMWare played a pioneering role in the virtualization of software, the technology that enabled users to run multiple virtual machines (VMs) on a single computer—and which paved the way for the rise of cloud computing. Headquartered in Palo Alto, California, VMWare generated $13.4 billion in revenue in 2022, and that same year, was acquired by Broadcom in a transaction valued at $61 billion. Today, VMWare’s customers include the City of Zurich, the East London NHS Foundation Trust, Carhartt, Purdue University, and many more.
One of their flagship products is NSX Intelligence, a security and network monitoring platform built with Apache Druid. NSX Intelligence ingests, processes, and analyzes network traffic for customers, providing security capabilities such as automatic recommendations for firewall rules and security groups, and anomaly detection and alerting. Further, NSX Intelligence can also visualize real-time or historical network topology, plotting out the flow of data across a network in an intuitive interface for drilling down, filtering, and other operations.
Figure 3. A VMWare NSX Intelligence dashboard for alerting and monitoring.
By nature, NSX Intelligence handles a significant volume of data. For each five minute interval, the platform streamed up to one million unique network flows. In addition, NSX Intelligence also had to instantly stream configuration changes for VMs (such as creation or deletion times), updates for security groups, and alterations to firewalls (such as modifying the source or destination, or rules for blocking or allowing access).
VMWare considered several databases to power NSX Intelligence, including Clickhouse and Presto. Ultimately, they decided to go with Apache Druid, for several reasons. For one, several of their data types (flow and configuration) were time-series data, which Druid is optimized for, as it partitions data by time and supports a wide variety of time-based aggregations and queries.
The NSX team also liked Druid’s distributed architecture, which devolves different tasks (like ingestion or querying) onto separate node types, thereby streamlining the scaling process. The NSX team found that as ingestion increased, they could simply add more middle managers and historical nodes, whereas if queries increased, they could simply add more brokers.
Further, Druid provided excellent support for all the different types of queries that VMWare customers utilized. As an example, interactive queries were used for visualizations, enabling immediate, real-time access to data, while batch queries were utilized for anomaly detection and rule recommendations. In addition, there were also heavy query patterns, which had to comb millions of rows, while light query patterns covered far less ground.
As a result of building their NSX Intelligence platform on Druid, VMWare was able to increase customer adoption, leading to more revenue, faster sales cycles, and reduced churn. In addition, within four years, NSX Intelligence has also doubled its revenue. Today, NSX Intelligence ingests 1.25 million events per five-minute interval and provides subsecond latency for interactive exploration, and provides a 30-day data retention.
To learn more about VMWare and their decision to power their NSX platform with Apache Druid, check out this video.
Customer story: Nielsen Media
Founded in 1923, Nielsen was one of the first companies to provide quantifiable sales and marketing data, pioneering such concepts as market share. Today, Nielsen Media is one of the largest market data firms worldwide, providing audience analytics and data for studios and broadcasters in over 100 countries. In 2021, Nielsen employed 44,000 people and earned $3.5 billion in revenue.
Nielsen’s Marketing Cloud solution provides advertisers and marketers with important performance data, including segmentation, consumption, demographics, and more, across every phase of a campaign or a brand journey.
At first, Nielsen Media used Elasticsearch to power Marketing Cloud, storing, querying, and organizing the raw data. However, as Marketing Cloud started ingesting data at a massive scale—up to 10 billion devices as of 2019—Elasticsearch revealed its limitations.
One example was sampling. Each day, Nielsen’s Big Data Group would sample their data, ingesting about 250 GB from their entire daily intake (which ran into the terabytes) into Elasticsearch. However, indexing this amount of data took almost 10 hours, and significantly slowed down queries—to the point where some would fail to complete.
As the number of concurrent queries increased, so too did the response time: by the time the Nielsen environment processed 120 queries per second, average retrieval times were up to two seconds long. When compounded across many different widgets on various dashboards, this delay made it impossible to provide an interactive end user experience.
There were several reasons for this poor performance. First, the same Elasticsearch nodes used for ingesting the data were also used for querying the data, adding to the load and slowing down clients even further. Also, due to the way that data was structured, queries had to scan every single shard on the corresponding index in order to retrieve data.
After transitioning to Druid, Nielsen’s new data pipeline ingests dozens of terabytes of data. By using features like rollup, Nielsen’s teams can compact a year’s worth of data into about 40 terabytes (with an interval of one day), preserving an acceptable granularity for end users while reducing resource usage.
Figure 4. An example of a Nielsen Marketing Cloud dashboard, courtesy of MarTech.
Another helpful capability is the ThetaSketch. In order to discover different intersections of attributes, such as all female customers who use mobile phones and are interested in technology, ThetaSketch can combine these different characteristics in an approximation of COUNT DISTINCT and rapidly surface the results. Customers can quickly build a Boolean formula on Marketing Cloud’s web application, which will then execute the required queries on the backend in milliseconds.
In particular, ThetaSketches are very helpful for highly specific segmentation and targeting—for both media buyers, who are purchasing ad space from studios and creators, and sellers, such as streaming platforms, broadcast channels, or content creators. As an example, a podcast studio (whose portfolio includes multiple shows and podcasts) could use Marketing Cloud to analyze audience data, build different intersections with ThetaSketches, and more effectively market ad space to buyers.
In doing so, this company might find that the audience of an unsolved mysteries podcast may consist of women with high school degrees (60%) with the majority of those (70%) being older listeners. In contrast, they might discover that the fanbase of an aviation podcast could consist of young men and women aged 18-29. With these intersections in mind, the podcast service can more precisely target ads: given the young age of the aviation podcast’s fandom, it is a safe assumption that some listeners are curious about career changes—and thus, this podcast could be an ideal channel for flight colleges or mechanic schools.
The process is similar for a media buyer that wants to buy ad space. For instance, a cruise line’s marketing department might run ThetaSketches on their traveler base and discover that a high percentage of their trips are purchased by women aged 45-60 years—either individually or in groups, such as couples or friends. As a result, they decide to buy ads for various channels, including the true crime podcast, whose followings match these demographics.
Itai Yaffe, then Nielsen’s Tech Lead for their Big Data Group, explains it best. “When Elasticsearch could no longer meet our requirements, we switched to Druid. The proof of concept was great in terms of scalability, concurrent queries, performance, and cost. We went with Druid and never looked back.”
Customer story: Atlassian
Since its founding in 2002, Atlassian has grown to be a leader in the field of business software, providing best-in-class products for collaboration, project management, software engineering, and much more. Today, Atlassian serves more than 240,000 customers across 190 nations, with almost 10 million monthly active users.
One of Atlassian’s flagship products is Confluence, a wiki-as-a-service. Confluence enables organizations of all sizes to share knowledge across teams and time zones, and alongside popular products like Jira and Crucible, comprise a unified family of products for collaboration and cooperation.
One key feature of Confluence is its robust analytics offerings, which provides corporate customers real-time metrics on their content, such as how their employees use, consume, and edit their wiki pages. This analytics suite included interactive dashboards for dissecting data on the most popular blogs, spaces, and wiki pages, details on the most active users and contributors, and even embedded API-driven recommendations for related pages that readers may be interested in.
Figure 5. An example of an Atlassian Confluence dashboard that provides rapid, open-ended data exploration.
At first, Confluence Analytics was built on PostgreSQL. However, once their customer base grew, so too did the amount of the data they generated—which created some complications for the Confluence team. This included the inability to scale beyond two years of data retention (at various levels of granularity), as well as the need to write a lot of custom code in order to run aggregations on large datasets.
In addition, there were issues with latency. For instance, pre-aggregating data in PostgreSQL took so long that it would expire before it could be used. Data also took too long to retrieve—to the point that application response times were in the minutes, even as users needed results in milliseconds.
“We moved our analytics to Druid a couple of years ago, and it is now the crowning jewel of our architecture,” software engineer Gautam Jethwani explains. “It has reduced our query times by many orders of magnitude.”
By switching to Apache Druid, the Atlassian team was able to improve the performance of Confluence Analytics in several ways. First, they were able to reduce query latency to under 100 milliseconds, a fivefold improvement in performance and a massive boon for time-sensitive use cases, such as real-time activity feeds and notifications. With Druid, the Confluence Analytics team also increased data retention to five years or longer, and added the ability to analyze funnels for up to two years of data—without impacting ingestion or aggregation.
Equally important, Druid helped the Atlassian team achieve a long-held goal: platform APIs, which each serve a single purpose, such as listing update groups by user or displaying top users by a space.
“Because of robust query times, we’re able to add new APIs and accommodate feature requests at a whim,” Jethwani says. “New developers can come in, simply decide which query is best for the use case, and populate function parameters. In fact, Druid has improved our performance so much that we were able to expose platform APIs that were able to take some of the load off Confluence itself.”
Druid’s native streaming support is also very helpful. Data is ingested via Amazon Kinesis streams directly into Druid, taking some of the workload off their Amazon EC2 clusters and leading to more performance improvements. In contrast, PostgreSQL required the Confluence Analytics team to process the events themselves and manually insert this data into the database.
Customer story: GameAnalytics
Founded in 2011, GameAnalytics was the first purpose-built analytics provider for the gaming industry, designed to be compatible with all major game engines and operating systems. Today, it now collects data and provides insights from over 100,000 games played by 1.75 billion people, totaling 24 billion sessions (on average) every month. Each day, GameAnalytics ingests and processes data from 100 million users.
Figures 6 and 7: Two examples of GameAnalytics dashboards
Prior to switching to Druid, GameAnalytics (GA) utilized a wide variety of products in their data architecture. Data, in the form of JSON events, was delivered to and stored in Amazon S3 buckets before being enriched and annotated for easier processing. For analytics, the GA team utilized an in-house system built on Erlang and OTP for real-time queries, while relying on Amazon DynamoDB for historical queries. Lastly, they created a framework (similar to MapReduce) for computing real-time queries on hot data, or any events generated over the last 24 hours, while also creating and storing pre-computed results in DynamoDB.
While this solution was very effective at first, stability, reliability, and performance challenges arose as GA scaled. “We also quickly realized that although key/value stores such as DynamoDB are very good at fast inserts and fast retrievals, they are very limited in their capability to do ad-hoc analysis as data complexity grows,” CTO Ramon Lastres Guerrero explained.
In addition, pre-computing results for faster queries also created some issues. After all, more attributes and dimensions led to larger query set sizes, and past a certain point, it was impossible to pre-compute and store every possible combination of queries and results. Therefore, “we limited users to be able to only filter on a single attribute/dimension in their queries,” Guerrero explains, “but this became increasingly annoying for our clients as they could not do any ad-hoc analysis on their data.”
This ad-hoc analysis is crucial to game design, especially where it comes to iterating and improving games based on audience feedback. Without the ability to explore their data flexibly, teams can miss out on insights, forfeiting opportunities to optimize gameplay, improve the player experience, and in the worst-case scenario, even lose players to more responsive competitors.
Druid helped resolve many of these problems. Because it was built to power versatile, open-ended data exploration, Druid did not require pre-computing or pre-processing for faster query results. Instead, Druid’s unique design, which could act on encoded compressed data, avoid the need to move data from disk to memory to CPU, and support multi-dimensional filtering, enabled rapid data retrieval—and powered the interactive dashboards so important to GA customers.
In addition, Druid’s separation of ingestion and queries into separate nodes, rather than lumping various functions together into a single server, also provided extra flexibility in scaling and improved resilience. “It allows more fine-grained sizing of the cluster,” backend lead Andras Horvath explains, while the “deep store is kept safe and reprocessing data is relatively easy, without disturbing the rest of the DB.”
Druid also provided ancillary benefits as well. First, it was a single database that unified real-time and historical data analysis, helping GA avoid a siloed data infrastructure. Further, Druid also was compatible with a wide range of GA’s preferred AWS tools, including S3, Kinesis, and EMR. Lastly, Druid’s native streaming support helped GA convert their analytics infrastructure to wholly real-time.
After switching to Druid, the GA environment ingests 57 million events daily, improving performance by 17 percent and reducing engineering hours by 20 percent. To learn more, read this guest blog by CTO Ramon Lastres Guerrero, or watch this video presentation.
Customer story: Innowatts
Founded in 2013, Innowatts provides AI-powered analytics for power plants, utilities, retailers, and grid operators, who collectively serve over 45 million customers globally. Headquartered in Houston, Texas, Innowatts counts major power companies as its customers, including Consolidated Edison and Pacific Gas and Electric.
Because power demand is elastic, ebbing and flowing based on time of day, weather conditions, and even holidays or weekends, it can be difficult for utilities and power plants to anticipate usage. Further, bulk energy storage solutions are not yet reliable enough to ensure a steady, uninterrupted supply of power—which necessitates the balancing of supply and demand across various sources, plants, and transformers.
Figure 8. A sample Innowatts dashboard, designed by Asif Kabani.
This is where Innowatts comes in. Built around customer-facing analytics, Innowatts empowers utilities to rightsize resources to match demand. Utility analysts use Innowatts to analyze consumption, build machine learning models to forecast energy requirements, and inform their bidding for additional power to fill any gaps in generation.
At first, Innowatts used several solutions in their data environment. Data from smart meters would be submitted to Innowatts either directly or through a third-party integration, where it would then be uploaded into Amazon S3 buckets for engineers to create Amazon Athena tables atop the data. Lastly, the data would be pre-aggregated in Apache Spark, exported into MongoDB for analytics, and finally, displayed for analyst customers in a custom UI.
As Innowatts grew, doubling its data ingestion from 40 million to 70 million smart meters, however, this data architecture encountered some challenges. Not only was it incapable of scaling, it also could not handle the high concurrency, as more analysts ran more parallel queries within a very limited window of time. In addition, Innowatts could not provide granular, meter-level data to their analyst customers, nor could they support open-ended data exploration by dimensions such as geography or zip code—a feature often requested by customers.
Latency would make data impossible to use. As software engineer Daniel Hernandez explained, “We go off this notion that if the client has it and it’s not fast, then it might as well be broken, because they need to be able to create insights really quickly off of that data.”
By transitioning to Apache Druid, the Innowatts team was able to achieve several milestones. They simplified their data architecture, removing Apache Spark and instead ingesting smart meter data directly into Druid. Their engineers were also able to streamline or automate their workflows, cutting out time-consuming tasks such as constant schema changes and updates. As an example, because Druid could automatically detect new columns, it automatically removed the bottleneck of having to configure a new model for a client.
Most importantly, utility analysts could now enjoy interactive dashboards with detailed visualizations and ad hoc aggregations, enabling operations like zooming in, drilling down, dragging and dropping, and much more. This versatility enabled clients to study their data through a variety of perspectives—and some unexpected ways.
One interesting approach was competitive forecasting. In essence, utilities would use different models or aggregation types to create a variety of predictive models, gauging them for accuracy and selecting the one which most consistently provided the most precise results. “Druid allows clients to be able to do all these competitions really, really fast,” engineer Daniel Hernandez explains, thus enabling utilities to better pinpoint customer trends, predict usage, purchase extra power, and match resources to consumption accordingly.
Innowatts engineers were also able to tailor aggregations in unique ways, even creating a filter to break down event data by specific days while excluding anomalies like weekends or hours with missing data. This enabled a greater degree of customization to better address client needs: for instance, a Texas-based utility will have different holidays or peak periods than a New York-based one, and so each customer can create different filters for their own specific situation.
Ultimately, Innowatts was able to reduce operational costs by $4 million, improve forecast accuracy by 40 percent, and enhance the lifetime value of customers by $3,000.
Apache Druid: the database for customer-facing data products
Built for scale, speed, and streaming data, Apache Druid is ideal for data products that provide external analytics to paying customers. Druid will provide subsecond response times even amidst the most challenging conditions—such as massive flows of data (anywhere from hundreds of thousands to millions of events per hour), vast datasets (from terabytes to petabytes), high rates of queries per second, and heavy user traffic.
Alongside its open source software, Imply (founded by the original inventors of Apache Druid) also offers paid products including Polaris, the Druid database-as-a-service—and the easiest way to get started with Druid. Another popular product is Pivot, an intuitive GUI for creating rich, interactive visualizations and dashboards—an important pillar of external-facing data products and a key generator of revenue.
For the easiest way to get started with real-time analytics, start a free trial of Polaris, the fully managed, Druid database-as-a-service by Imply.