Machine learning models are involved in every aspect of our digital economy. They approve or deny credit card transactions. They trade stocks at the speed of the market, moving far faster than humans can. They populate product recommendations in milliseconds, before shoppers are distracted.
But the landscape is changing. Data is increasing in size and speed, making it difficult for older models to keep up. Every second, terabytes of data stream into applications from cybersecurity platforms to wind turbine systems. Conditions evolve so fast that they leave models behind—think hackers switching tactics or cascading failures halting assembly lines.
As a result, organizations need new ways to train, run, and improve machine learning. They need a database optimized for storing, organizing, querying, and visualizing massive amounts of data—increasingly real-time, streaming data. Few analytics databases were designed to work with ever-growing volumes of fast-moving data. Most were built to ingest data solely via batch processing—which cannot match the speed or real-time nature of streaming data technologies like Apache Kafka® or Amazon Kinesis.
Designed for speed, scale, and streaming data, Apache Druid fulfills several critical roles in machine learning environments. Druid can enable fast exploration and analytics on vast quantities of raw data, clean training datasets of outliers that skew model outputs, and store features for model training. Druid is used to monitor the accuracy of machine learning models, retrieving and analyzing ground truths, the “correct answers” to a model’s predictions. Druid also enables rapid, real-time access to machine learning inferences.
Data discovery and exploration
Before training begins, teams have to sift through their training data, analyzing it, understanding its nuances, and discovering key insights. Depending on the use case, this data can come from different sources (such as APIs or stream processors) and arrive in different forms (like CSV files or JSON objects).
Often, this raw data comes in massive amounts. Consider a team designing a cybersecurity model, which will assess authentication attempts, rate each on the possibility of malicious activity, and act accordingly (either denying the login and flagging it for review, or allowing it through). This model will need to be trained on both legitimate logins made from different IP addresses as well as intrusions utilizing SQL injections, malware, and other tactics.
In this situation, a raw dataset could consist of millions of different log events, each corresponding to a security event. For supervised machine learning, data scientists need to explore these examples to define their training data, with clear examples of suspicious activity, legitimate logins, and examples that blur the line between both (such as logins made by a real customer using a friend’s computer while on vacation).
By understanding the statistical norms within their dataset, data scientists can determine what will (and will not) be useful in the training phase. They can use Druid to rapidly compute statistics on terabytes to petabytes of data more efficiently than other databases, executing operations such as mean, standard deviation, and variance for numeric values within the dataset. For time series data, they may use tools such as line graphs, seasonal decomposition, and autocorrelation plots alongside Druid to uncover trends and associations.
Data scientists may also find graphics, such as bar charts, histograms, or heatmaps, useful for gaining insights into data distribution. Imply, the company founded by the original creators of Druid, provides Pivot, an intuitive GUI for building and sharing interactive visualizations that enable real-time exploration of massive, complex datasets. With Pivot, actions like filtering, drilling down, or dragging and dropping can be executed in milliseconds.
In fact, Druid’s subsecond query speed is a huge advantage for exploration. It is always frustrating to wait minutes (or longer) for query results. Unlike many other databases, Druid provides millisecond response times independent of the number of users accessing data, the shape of the data, or the rate of queries per second.
Ultimately, the faster the exploration process, the sooner the team can complete data discovery and advance to preparing training data.
Rapid inferencing for real-time decisioning
After models are trained, they are used with real-world data to create patterns and produce predictions, or inferences. However, time-sensitive use cases may be better served with prepared inferences that are created ahead of time and stored in a database for fast access.
Druid’s ability to quickly retrieve data enables inference retrieval in milliseconds. For example, an outdoor retailer could use a recommendation engine to run a model once an hour across all the purchases of the day, using the data to figure out which products are purchased together, and thus associated with each other.
Afterwards, these preset suggestions for various retail items (such as mosquito repellent alongside camping tents or bicycle helmets with bicycling gloves) would be stored together in their database. This ensures that models can immediately find and populate recommendations as customers browse the various product pages.
Another use case is real-time fraud detection, especially for financial institutions. Machine learning models can help review credit card transactions or new bank accounts, rating them based on their likelihood of fraud; any ratings that meet a certain threshold are flagged for further review by human analysts.
In this situation, a model will run fraud prediction inferences on millions of bank accounts on a regular basis (perhaps hourly or daily), before storing them in Druid. Once a new transaction (such as a cash withdrawal) executes, the model can retrieve the relevant inference in milliseconds, enabling it to quickly evaluate the action, rate it on the probability of fraud, and act accordingly. By utilizing these prepared inferences, a model can block suspicious behavior without compromising legitimate activity.
Monitoring model accuracy
Druid is also used to evaluate machine learning results. Verta, which provides infrastructure to facilitate the creation, testing, and iteration of models, chose Druid to power alerting, monitoring, and customer-facing metrics.
For Verta to assess the accuracy of any machine learning model (and determine possible drift), they store all the inputs, outputs, and core tags for identifying data. “This could be millions to billions of predictions per minute,” explains Cory Johannsen, a Senior Software Engineer at Verta. Druid fulfills this use case perfectly, because it was designed to accommodate massive amounts of data cheaply and efficiently.
Verta also had to process ground truths, or the correct answers for a prediction—such as a high likelihood of fraud for a bank transfer. However, as Johannsen points out, ground truth can arrive long after a prediction is produced. Because Druid provides automatic backfill for time-based data, the Verta team can more easily organize their ground truths, even if this information arrives months or years after the fact.
Another advantage was Druid’s extensive support for a wide variety of aggregations and other operations. “It’s hard to overhype this,” Johannsen explains. “Building these aggregators from scratch is error prone—so having a solid, reliable set of pre-built core functions is huge to us.”
With its subsecond queries, ability to scale massively, and its extensive analytics, Apache Druid can support every aspect of the machine learning pipeline, including data exploration, feature engineering, training, real-time inferencing, and monitoring.
To test Apache Druid in your machine learning environment, sign up for a free trial of Imply Polaris, our fully managed, Druid-as-a-service.
To learn more about Druid’s unique architecture and design, read our white paper.