A faster database: what developers are building with Apache Druid
For decades, the focus of analytics has been historical BI reporting, using batch-oriented data warehouses for executive dashboards and reports. Now, developers are advancing analytics further by building modern analytics applications that are powering new use cases that deliver interactive data experiences on real-time and historical data at massive scale.
Build better cloud applications with full visibility into the health and performance across the entire application.
Observability allows teams to monitor modern systems more effectively and helps them to find and connect effects in a complex chain and trace them back to their cause. It gives system administrators, IT operations analysts, and developers visibility into their entire architecture. This allows them to drill into how different components of an application are performing, identify bottlenecks, and troubleshoot issues.
Building an observability app powered by a real-time analytics database provides the ability to handle high-cardinality metrics in high volume for fine-grained visibility across internet scale services.
Answer ad-hoc questions across rapidly changing data at any scale
It’s impossible to anticipate all the questions (using predefined aggregations) with today’s internet scale services. With cardinality data, you are never facing the same repeatable issues over and over.
Access to both transactional histories as well as real-time data Immediate ingestion of real-time data so you can see what’s happening now while easily comparing it to historical data.
Debug only works if you can take one step quickly after the next It should return in subsecond query times because when you are debugging it is important not to break your state of flow.
Operating Internet-scale services requires fine-grained visibility down to the individual user, tenant, or application behavior while also providing visibility across the entire application. Most traditional off-the-shelf monitoring options fail to scale or become very cost-prohibitive when used at scale.
When you’re building an analytics application for effective observability at any scale a real-time analytics database is critical for high volume and high data throughput. High cardinality metrics can be ingested in milliseconds, making them immediately available for monitoring analytics. This enables you to rapidly visualize and explore both real-time stream data and historical data with subsecond query response times.
Real-time analytics databases enable rapid analysis of application events with thousands of attributes and compute complex metrics on load, performance, and usage. For example, it’s easy to rank API endpoints based on 95th percentile query latency, then slice and dice how these metrics change based on any ad-hoc set of attributes such as time of day, user demographic, or datacenter location.
To ensure a consistently great experience for more than 200 million members in more than 190 countries enjoying 250 million hours of TV shows and movies each day, Netflix built an observability analytics application powered by Apache Druid.
By turning log streams into real-time metrics, Netflix is able to see how over 300 million devices (across 4 major UIs) are performing at all times in the field. Netflix chose Apache Druid as their real-time analytics database to power their analytics application because it uniquely meets their high ingestion rate of data, high cardinality, and fast query requirements. By ingesting over 2 million events per second and executing subsecond queries over 1.5 trillion rows, Netflix engineers are able to pinpoint anomalies within their infrastructure, endpoint activity, and content flow.
“Druid is our choice for anything where you need subsecond latency, any user interactive dashboarding, any reporting where you expect somebody on the other end to actually be waiting for a response. If you want super fast, low latency, less than a second, that’s when we recommend Druid.”
Harness user activity data to optimize all aspects of the user experience as people interact with your web, mobile, and other applications.
Businesses are now collecting, analyzing, and aggregating user activity data, known as behavioral data to gain insights like customer traffic analysis, marketing campaign effectiveness, market segmentation, sales funnel analysis, and more. Behavioral data includes direct interaction (such as website clicks and mobile app swipes), views, and related context, including page load time, loiter time, browser or device used by the visitor, and more.
Building an analytics application powered by a real-time analytics database for behavioral data is critical to analyze the product experience, understand user intent, and personalize the product experience for different customer segments.
Answer ad-hoc questions across long sequence lengths with high cardinality
The number of unique events for a modern website can range from thousands to tens of thousands. It’s common for each session to generate hundreds of events.
Consume raw behavioral data quickly and efficiently
Data needs to be aggregated, filtered, and enriched. Raw events need to be filtered for bot-generated traffic. Handling such data at scale is extremely challenging.
Access to both transactional histories as well as real-time data
For several behavioral use cases, such as targeted personalization for a better user experience, the analysis must combine real-time activity with a historical understanding of the user’s past actions.
Real-time analytics databases are designed for subsecond queries of complex data at high concurrency that combine real-time streams with historical data. Real-time analytics databases’ search capabilities and filter capabilities enable rapid, easy drill-downs of data along any set of attributes, enabling measurement and segmentation by age, gender, location, user preferences, purchasing patterns, and any other desired characteristics.
Off-the-shelf applications such as Google Analytics and Adobe SiteCatalyst help with clickstream analysis. However, these applications have scale limitations and lack access to raw data. As data sets quickly balloon to massive scale, developers need a database that enables their analysts to explore always-fresh data in real-time, feeding their curiosity and enabling proactive decision-making. They cannot act quickly and effectively with stale, pre-packaged data.
Real-time analytics databases may be used for funnel analysis, and to measure how many users took one action, but did not take another action. Analysis of the funnel is critical to learning from design decisions: how long did it take to get from the top to the bottom? What kinds of people abandoned their journey halfway through? When we made a change to page X, did it improve conversion rates from one step to the next?
Real-time analytics databases can be used to compute impressions, interactions, and key conversion metrics, filtered on publisher, campaign, user information, and dozens of other dimensions supporting full slice and dice capability.
WalkMe is a Digital Adoption Platform pioneer that offers a 360-degree solution to leading organizations worldwide. WalkMe helps employees and customers at some of the world’s largest companies engage and adopt digital products, ensuring organizations of all sizes can undergo smooth digital transformations.
The legacy analytics system that was originally used to track core product usage was Elasticsearch, which was initially leveraged as a simple log management system. As WalkMe’s analyst’s queries evolved from simple troubleshooting queries to ones that measured complex engagement stats, these queries became less and less suited to the search-focused architecture of Elasticsearch.
“Once we realized the legacy architecture was not well suited to behavioral analytics, and would not scale with our growth, we began searching for an alternative to transition our classic log search approach to a real-time analytics database that scales linearly with our traffic. Apache Druid met these criteria. Druid enables us to monitor performance across billions of client devices in real-time. We can leverage Druid to compute any arbitrary metrics over any ad-hoc groups of users. We can track business critical measures such as retention and attrition, plus many other forms of engagement and usage metrics. As a result, we can now gain the type of insight we need to optimize and segment our code for different host platforms, applications, and websites, per their specific needs.”
Search, detect and investigate data in real-time to quickly find anomalies to prevent malicious attacks.
Security and fraud analytics is a proactive approach to cybersecurity that uses data collection, aggregation, and analysis capabilities to perform vital security functions that detect, analyze and mitigate cyber threats. Security and fraud analytic tools such as threat detection and security monitoring are deployed with the aim of identifying and investigating security incidents or potential threats such as external malware, targeted attacks, and malicious insiders. With the ability to detect these threats at early stages, security professionals have the opportunity to stop them before they infiltrate network infrastructure, compromise valuable data and assets, or otherwise cause harm to the organization.
Security and fraud analytic solutions aggregate data from numerous sources that include endpoint and user behavior data, business applications, operating system event logs, firewalls, routers, virus scanners, external threat intelligence, and contextual data, among other things. Combining and correlating this data gives organizations one primary data set to work with, allowing security professionals to apply appropriate algorithms and create rapid searches to identify early indicators of an attack. In addition, machine learning technologies can also be used to conduct threat and data analysis in near real-time.
For analyst teams to locate attacks, they need to be able to examine every area in the security landscape in real time. Security landscapes can comprise billions of endpoints, across hundreds of regions and hundreds to thousands of users. A single attack can span the entire digital landscape, leaving traces in network activity, hashes, log files, and connection requests, all of which are maintained in separate databases or datasets with vastly different schemas.
To keep detection from slipping from seconds to days, security analytics applications need to query events the moment that they occur. As each customer’s landscape is unique, getting data from a multitude of sources and formats into an effective platform for analysis can be onerous using traditional data engineering methods.
This is why developers are building modern analytics applications powered by real-time analytics databases for security and fraud analysis. These analytics apps use real-time analytics databases to collect and aggregate both real-time streams and historical batch data, monitor events continuously with subsecond responses, and provide the context required to distinguish true threats from false positives.
Answer ad-hoc questions across massive amounts of data
Security landscapes can comprise billions of endpoints, across hundreds of regions and hundreds to thousands of users.
Query real-time data on arrival to immediately detect threats
To keep detection from slipping from seconds to days, security applications need to query events continuously, with incoming events included immediately.
Protection that never sleeps with an always-on application
Threats don’t take vacations, and neither can security analytics. Databases designed for zero downtime (planned and unplanned) and zero data loss are critical.
Real-time analytics databases enable developers to build security analytics apps for analyst teams to examine every area in the security landscape in real time to locate attacks.
Security landscapes can comprise billions of endpoints, across hundreds of regions and hundreds to thousands of users. A single attack can span the entire digital landscape, leaving traces in network activity, hashes, log files, and connection requests.
To keep detection from slipping from seconds to days, real-time analytics databases enable security applications to query events the moment they occur. With real-time drill-down analytics capabilities, developers can build the right security app to close the gaps in threat detection and remediation.
Sift prevents fraud with industry-leading technology and expertise and regularly deploys new machine learning models into production. Sift’s customers use the scores generated by machine learning models to decide whether to accept, block, or watch events and transactions. Since each customer has unique traffic and decision patterns, Sift needed a tool, which can automatically learn what “normal” looks like for each customer.
Sift built an automated monitoring tool, Watchtower, a system that would use anomaly detection algorithms to learn from past data and trigger alerts in real time on unusual changes. Watchtower is powered by Apache Druid, a real-time analytics database for interactive experiences with data. With Druid, they are able to aggregate data by a variety of dimensions from thousands of servers. They can then query this data across a moving time window with real-time analysis and visualization.
Sift is now able to proactively contact customers when anomalies are detected, preventing potential business impact for their customers.
“As the leader in Digital Trust & Safety, we enable online businesses to prevent fraud and abuse while streamlining customer experiences. We built an anomaly detection engine called Watchtower, which uses machine learning models to detect unusual activity. Apache Druid and Imply help us analyze data with an interactive experience that provides us with on-demand analysis and visualization.”
Analyze sensor data for a wide range of use cases including supply chain optimization, predictive maintenance, safety, and process efficiency management.
The Internet of Things (IoT) drives value across nearly every sector, spanning from manufacturing and logistics to retail and resource management. Data from a network of connected “things” that include drones, delivery trucks, medical devices, security cameras, construction equipment, thermostats, gaming consoles, and almost every other device that can be manufactured provides constant telemetry information on operating environments and metrics.
IoT sensors provide tons of valuable data. By understanding usage patterns, companies can predict shifts in customer behavior, design the next generation of products, automate routine tasks, and prevent issues before they occur. Unfortunately, harnessing IoT data is no easy task as they generate massive high-speed streams that are difficult to process, analyze, store, and secure. IoT data is also highly perishable, and without the right tools, organizations miss opportunities to act.
This is why developers are building modern analytics applications powered by real-time analytics databases. By extracting value from IoT sensors and systems in real time enables companies to create better products, services, and experiences.
Answer ad-hoc questions across massive amounts of time series data
IoT devices and sensors continuously generate massive quantities of data with a timestamp. Traditional databases are not designed to manage time series data as these databases input each data point separately, thereby creating a massive number of duplications.
Query real-time data on arrival as IoT is real-time by nature
With IoT, actions must be driven in seconds to less to be useful. This requires a database that enables high speed of ingestion combined with subsecond queries for low latency between when an event occurs and when it is available for query.
Protection that never sleeps with an always-on application
Sensor data is continuously being generated and IoT devices need to operate in real time. Look for a database designed for zero downtime (planned and unplanned) and zero data loss are critical.
Effective analytics for IoT and other telemetry data requires very high-speed ingestion of events, with immediate visibility of insights to the end user or automated process. Also required is ingesting data from databases and files to provide historical context and combining data from both sources with often complex queries.
Even when datasets grow to petabytes, IoT analytics needs high-performance, with consistent subsecond performance. High concurrency is also a requirement, as many dimensions and metrics need constant monitoring, so hundreds or thousands of queries need to execute simultaneously as well as both external and internal end user dimensions.
Downtime is not an option, so IoT analytics needs databases that are designed for zero planned downtime, self-healing clusters to avoid unplanned downtime, and durable storage to prevent data loss.
Cisco ThousandEyes enables organizations to visualize any network as if it was their own, quickly surface actionable insights, and collaborate and solve problems with service providers.
They combine a variety of active and passive monitoring techniques to give their customers deep insight into user experience across applications and services delivered over the Internet. Monitoring the health of WAN network devices, such as wireless access points, routers, switches, firewalls, and load balancers, requires a tremendous amount of data that must be collected in real time.
To quickly analyze network device issues, ThousandEyes built an analytics application powered by Apache Druid to ingest and query large amounts of sensor data.
Druid powers ThousandEyes customer-facing dashboards that can be configured with many group bys, many filters, and visualize a lot of metrics. This enables ThousandEyes customers to interact with their data by asking questions in real time.
This replaced usage of MongoDB, which was not designed for analytics on their burgeoning amount of sensor data. When customers needed to visualize historical data over a period of days, dashboards were taking 15 minutes or longer to load, making it impossible to drill down into different metrics.
Now, with Druid behind their analytics app , ThousandEyes is able to keep up with their rising sensor data and deliver a consistently great and interactive experience to their customers by reducing dashboard latencies by 10x. This enables ThousandEyes customers to visualize their entire network topology across all their network devices to track down device interface issues in seconds and eliminate application-impacting device behaviors.
“To build our industry-leading solutions, we leverage the most advanced technologies, including Imply and Druid, which provides an interactive, highly scalable, and real-time analytics engine, helping us create differentiated offerings.”
Empowering customers and partners with real-time operational and business insights.
More companies are giving their customers and partners insights as part of a value-added service or as a core product offering. These real-time insights are usually focused on the security, performance, billing, or usage of the customer environment. For example, in adtech, it’s critical to deliver instant and full transparency to advertisers on how users engage with their ads.
The crux of the capacity to deliver meaningful external analytics is the ability to support high concurrency. That’s because the best applications have the most active users and engaging experience. This leads to thousands or more users, each generating multiple queries as they interact with the data. Most traditional off-the-shelf options are built for internal usage but not for extending insights to customers and partners. They fail to scale or become very cost-prohibitive when used at scale.
It’s a challenge to use any database to build a customer-facing analytics application. There’s more on the line than internal use cases because now the analytics are part of the customer experience. Milliseconds of latency make a difference, downtime is costly, and concurrency and expenses can go through the roof. The last thing you want is frustrated customers because their applications are constantly getting hung up.
Only a database designed for real-time analytics can deliver subsecond queries, high concurrency, combined real-time & historic data, and high reliability at scale with reasonable cost.
Need more information about Druid and Imply? Let us set you up with a demo.