A Buyer’s Guide to OLAP Tools

Dec 29, 2023
William To

Co-author: Darin Briskman

What is an OLAP database?

Online analytical processing (OLAP) tools enable teams and organizations to perform complex analysis on large volumes of data. End users, who can be anyone from business analysts to executives to suppliers to customers, use OLAP products to understand and improve operations, processes, performance, and profitability. 

In contrast to transactional (OLTP) databases, the primary purpose of OLAP is to extract actionable insights, drive change, and enable visibility into all aspects of an organization’s performance—both on the macro and micro levels. While there are exceptions, OLAP databases also typically deal with historical data, rather than the near real-time data of OLTP databases.

As an example, a solar plant operator wants to assess the efficiency of their photovoltaic panels, ensuring that they’re tracking the angle of the sun across the sky and generating the optimal amount of electricity required. To do so, they can go back into one (or more) year’s worth of data, comparing the performance of panels against the weather conditions and solar availability documented at the time. 

With this data, the operator can identify malfunctioning panels or inverters (which convert solar energy into electricity), improve their panel tracking, and ultimately, optimize their electricity output.

There are three main types of OLAP databases.

Multidimensional OLAP (MOLAP) databases, which utilize datacubes (more below) for exploration and analytics. MOLAP databases are ideal for less-technical employees, who may not be familiar with SQL statements or JOINs, to dissect and understand their data. MOLAP databases rely on features such as pre-aggregations and indexing to organize data and ensure fast query responses.

Relational OLAP (ROLAP) databases utilize the tables and relationships of relational databases and the familiarity of Structured Query Language (SQL) for analysis. Due to their architecture, ROLAP databases are both highly scalable and cost effective—by removing MOLAP pre-aggregations, they have a much smaller storage footprint. However, the lack of pre-aggregations also means that queries may take longer to complete.

Hybrid OLAP combines features from ROLAP and MOLAP, storing important data in MOLAP cubes for deep investigation and rapid querying, while keeping the rest in a relational database. This provides the scalability and compact footprint of a relational data architecture alongside the fast retrieval times and user-friendly interface of a MOLAP database.

How do OLAP databases work?

Before an OLAP database can execute any complex operations, it first has to gather data. OLAP products extract data from various sources, such as data lakes, transactional databases, data files (such as CSV or Parquet), streaming services such as Amazon Kinesis or Apache Kafka, and more. Therefore, OLAP platforms must be compatible with a variety of different connectors and other technologies. 

Most OLAP databases follow a variation of the extract-transform-load (ETL) process, where data is collected, transformed into a usable composition for analytics, and finally loaded into the database. Cloud-based data architectures usually use a variation of this process, where data is first extracted, then loaded into the analytics database and transformed inside the database (ELT). 

After data intake is completed, it has to be transformed into a format or structure (known as a schema) that is suitable for analytics. Transactional systems use schemas optimized to enter data quickly by splitting it into many small pieces, a process called normalization. For analytics, the schema should be optimized to answer questions, or denormalized. 

Normalized data

The same data, denormalized

At this stage, data should also be cleaned of “noise,” which includes missing fields, cloned values, errors, and outliers—each of which can skew the results of any analysis.

After data is loaded into an OLAP database and transformed to the right schema, analytics can begin. One popular method of investigating data is through a data cube, which is more of a framework for executing multidimensional analysis than an actual 3D object. 

For example, an analyst at an outdoor retailer needs to create a yearly sales report. To begin, the analyst might select three categories for the data cube’s axes: the highest-selling products across their entire catalog, the most successful stores, and quarterly sales numbers. If necessary, the analyst can also add more dimensions, such as year-over-year change or even buyer demographics. The resulting data cube could look like this:

Afterwards, the analyst is now ready to run the five different types of OLAP aggregations: rollup, which compresses and summarizes large volumes of data; drill down, which disaggregates data in more detail; slice, which cuts the data cube along a single dimension for more detail; dice, which isolates multiple dimensions for comparison and contrast; and pivot, which rotates the data cube for a new perspective on data.

To learn more about the data cubes and operations, please read Imply’s introduction to online analytical processing.  

What are some common use cases for OLAP?

Because of its ability to unearth trends and insights, OLAP is a versatile tool that is used across many sectors and industries.

Sales and marketing. Teams use OLAP products to analyze sales data, plot out customer journeys, and improve marketing campaigns. OLAP tools can be used to identify and classify potential customers into segments, create patterns of buyer behavior, and fine tune digital ad targeting.

Finance. OLAP is ideal for financial operations, including budgeting, forecasting, and analysis. Users can determine areas of spend, pinpoint wasteful purchases, estimate future requirements, and explore financial data across dimensions including time and geography.

Inventory and logistics. OLAP can help manage supply chains, tracking deliveries and purchases, automatically replenishing low stock, and determining the most profitable products.

Energy and utility management. Analysts use OLAP to analyze energy consumption, chart out consumer trends, forecast future demand, monitor equipment output, and optimize maintenance schedules. 

Smart asset management. Whether it’s offices or factories, organizations can analyze data from smart buildings to determine times of peak usage, automate climate control by population, reduce energy costs, and predict future foot traffic. 

Site reliability engineering and DevOps. With OLAP, SRE and DevOps teams can comb through reams of data from their digital environment in order to investigate issues, resolve problems before they escalate, restore service outages, and improve operations and processes.

What are the key features of an OLAP tool?

Ultimately, OLAP products are intended to explore large data sets in a flexible, comprehensive manner. Therefore, they will require some combination of the following features.

Multidimensional modeling, or the ability to model data across dimensions and hierarchies, such as time, geography, products, and customers. One example of this is the data cube (mentioned above).

Scalability, to accommodate ever-growing demand and volumes of data. Because OLAP is used to create big picture insights, it has to manage petabytes or terabytes of data generated over the course of long periods of time. 

Concurrency, to support highly parallel queries and users. As analytics become more vital to users and organizations, more stakeholders require access to data. At larger organizations, OLAP products may have to contend with hundreds of users running thousands of queries simultaneously.

Support for time series data, which is vital in sectors such as the Internet of Things (IoT). Timestamped data requires specialized features, such as interpolation or gap filling, in order to ensure that there are no missing data values prior to analysis.

Aggregations, such as average, sum, count, and more. These operations are essential for summarizing and reducing the size of data sets, providing statistical insights for reporting and identifying trends, and comparing and contrasting figures to find areas of inefficiency and improvement. 

Flexible, ad hoc queries. When faced with “unknown unknowns,” when teams are unclear as to what exactly they are searching for, they have to explore data in an open-ended manner—without pre-aggregating data. This requires significant resources and moreover, an easy-to-use interface that can help users of all skill levels drag and drop or point and click to find answers.

Detailed visualizations, such as bar charts, pie graphs, scatter plots, and heatmaps, to present data in an intuitive, interesting format. Ideally, these visualizations would be interactive, enabling users to zoom in and out and take different perspectives on their data. As a bonus, visualizations should be shareable to facilitate collaboration and better disseminate information.

What are the best OLAP tools for business intelligence?

Given the vast number of OLAP products available today, buyers may have difficulty choosing a solution to meet their needs. Here are several options to look at.

Apache Druid

Apache Druid is the database for speed, scale, and streaming data. Designed to support the rapid retrieval times and concurrent traffic of transactional databases, as well as the intensive aggregations and huge datasets of analytical databases, Druid excels at the requirements of real-time analytics. Imply, the company founded by the creators of Apache Druid, also provides an entire product ecosystem built on Druid.

Druid is natively compatible with Apache Kafka and Amazon Kinesis, the two most popular streaming technologies available today. This provides an easy way to ingest real-time data with minimal configuration. Events are also ingested exactly once, ensuring that they are not duplicated, and are available for queries and analysis immediately on arrival.

Druid is also highly scalable, assigning different functions (such as cluster control, query, and storage) to separate node types that can be added or removed as needed. After nodes are scaled up or down, Druid will also automatically retrieve data and workloads from deep storage and rebalance them across the remaining (or added) nodes. 

In addition, Druid can quickly complete queries, even given high query traffic and large user volumes. Queries are executed using the scatter/gather process, being divided up into pieces, sent to the relevant data node for scanning, and finally reassembled by broker nodes—a process that takes milliseconds even on large data sets. To further accelerate the process, Druid also divides data into segments, which can be scanned simultaneously by multiple queries—rather than being locked and scanned by one query at a time.

To learn more about how Druid was designed to facilitate fast, subsecond queries under load and at scale, read the Druid architecture whitepaper.

Druid also supports time series data with features like automatic backfill, which will organize late-arriving data into the proper place without human intervention. Druid also includes interpolation and gap filling, which use a variety of methods to fill in missing values and ensure that time series data is ready for analysis.

Because Structured Query Language (SQL) is so common, Druid also includes a layer that automatically translates SQL queries into Druid native queries. This layer translates most common SQL expressions, enabling users to work with their language of choice, and removes the need to learn yet another query language.

Lastly, Imply also created Pivot, an intuitive GUI for building shareable, interactive visualizations and dashboards. Users can drag and drop, zoom in or out, and investigate data in depth with a few clicks of the mouse—Pivot will automatically handle any required operations (such as SQL queries) on the backend.

Today, Apache Druid is highly rated on G2, the leading peer-to-peer review site for business technologies. In addition, Druid is ranked at 99 on the DB-Engines Ranking—a 129-place improvement from its position last year.

Today, Druid is used by a number of leading organizations in both its open source and paid forms (via Imply). Some notable names include:

Electric adventure vehicle manufacturer Rivian, which utilizes Druid to power real-time analytics for downstream applications, including remote diagnostics and predictive maintenance.

Streaming media giant Netflix uses Druid to monitor user experience across 300 million devices, four major UIs, and multiple types (tablet, phone, smart TV, desktop). Netflix then uses this real-time data to consistently deliver a world-class entertainment experience.

Telecommunications leader NTT uses Druid to power their analytics stack, providing ad-hoc exploration to users of all technical abilities. Their environment ingests over 500,000 events per second, stores over 10 terabytes of data in each Druid cluster, and requires significantly fewer resources to ingest, store, and query data in contrast to time-series databases. 

International travel platform Expedia Group uses Druid as the foundation for their internal self-service tool, which empowers users to segment travelers for more precise, effective marketing and advertising. With Druid, they reduced query latency from 24 hours to under five seconds, were able to execute DataSketches at ingestion for fast approximations, and support dynamic criteria across massive datasets. 

Comparison: Apache Druid and Snowflake

In many ways, Druid and Snowflake have enough differences that a side-by-side comparison is not a 1:1 analogy. Instead, while there are overlaps between the two databases, they are used very differently.

Snowflake was created as a cloud data warehouse—which, despite its resemblance to real-time analytics databases like Druid, has some differences. Data warehouses are ideal for regular reporting, at intervals such as daily, weekly, or monthly, usually in areas where factors such as  speed or concurrency are less important. This means that Snowflake can execute complex analytical queries on massive datasets, though without subsecond response times. 

Snowflake’s compute resources are called virtual data warehouses, and are containerized for easy scaling. This provides flexibility in costs—you’ll only pay for what you use—but enacts a performance penalty, as containers require time to spin up. Because of this containerized, partitioned design, Snowflake is a cloud-only product, without options for on-premises or hybrid deployments.

By default, Snowflake supports up to eight concurrent queries in each virtual data warehouse, though Snowflake can support high concurrency by deploying more warehouses. However, this means that costs will scale in a linear fashion: 20 warehouses to support 160 parallel queries will bring a 20x increase in expenses.

As a data warehouse, Snowflake is also optimized for batch processing of historical data rather than streaming. It also utilizes a relational data model that is compliant with ANSI SQL (the most commonly used SQL dialect today) and includes several APIs and other tools to more easily model data.

Ultimately, if a use case lacks the time pressures of real-time data and analytics, then Snowflake can be a suitable option. These use cases would be more traditional business intelligence and reporting, where deadlines are both longer and more predictable, and both user volumes and query rates are low. Some use cases, such as security and IoT, may not be a fit for Snowflake, given requirements for data to be stored fully or partially in on-premises, physical servers.

However, where Druid excels is performance. Druid is ideal for high-speed, low-latency use cases. If there are many users or applications running multiple queries per second on massive datasets—and they need answers immediately—then Druid is the solution of choice. 

There’s also another alternative: using both Snowflake and Druid in the same environment. In fact, this is the case for many application architectures, which utilize Snowflake for long-term analysis and reporting, and Druid for rapid, real-time use cases. 

To learn more about Apache Druid and Snowflake, read this blog or this comparison chart.

Comparison: Databricks and Apache Druid

Founded by the creators of Apache Spark in 2013, Databricks is a cloud-native analytics database for large volumes of data. As a data lakehouse, Databricks combines the flexibility of data lakes with the structure, performance, and governance of data warehouses. For instance, Databricks stores the raw, unprocessed data in its native format (whether it’s structured, unstructured, or semistructured) like a data lake, while supporting the transactional create read update delete (CRUD) operations and analytical workloads of a data warehouse. 

Databricks enables users to work jointly through interactive notebooks, visualizations, and other shared tools. This provides a high degree of concurrency, as users can simultaneously edit and execute across multiple notebooks, running code, querying data, and testing machine learning all at once. 

Databricks also scales similarly to Druid. Clusters can automatically add or remove nodes to meet demand. Databricks also includes Enhanced Autoscaling, a capability that will allocate resources without impacting pipeline latency.

Because it is based on Spark, Databricks also includes all of Spark’s ancillary features, specifically its libraries and APIs, facilitating operations on massive datasets. Data can be ingested or integrated from various external data sources, such as data lakes, databases, streaming platforms, and more, as Databricks is compatible with a wide range of connectors and other APIs. 

Due to its collaborative features, Databricks is often used for several components of the machine learning process. Its notebooks and libraries are an excellent medium through which data scientists can explore, clean, and prepare data for training, especially in a collaborative manner. Machine learning models can also be developed through a wide range of specialized libraries, and the relevant code written in interactive notebooks. Afterwards, these models can be deployed as REST APIs or batch jobs, monitored for continued accuracy, and corrected for any possible drift that may arise. 

As with Snowflake, determining when to use Druid and when to use Databricks is a matter of requirements. For instance, in areas like AI and machine learning, Databricks is a more versatile solution because it provides the ability to write code in notebooks and ML-specific features, such as automated model selection and hyperparameter tuning. 

Because of its notebooks and concurrency, Databricks is a good match for coordinating work across teams and departments. Databricks also has an advantage in complex extract transform load (ETL) pipelines, where data has to be pulled from different sources, processed into a compatible format or structure, and loaded onto a database for storage and analysis. Databricks is also more of a generalist than Druid, as it can execute a broader range of general purpose tasks such as data warehousing and batch processing. 

The architecture of Databricks is designed for high scalability, but not for high performance nor high concurrency. It’s not suitable when you need subsecond query times or many queries happening at once, while Druid is able to deliver subsecond high concurrency queries with datasets of any size.

Druid is suitable for part of the ML training process, specifically discovery, rapid inference, and model accuracy monitoring. In this sense, it is more of a specialist—its subsecond response times are invaluable when speed is of the essence, such as for cybersecurity, fraud detection, or instant personalization. 

In fact, anything that requires real-time analytics for fast-moving (usually streaming) data will be better served by Apache Druid, given features like exactly-once ingestion and query on arrival—crucial for leveraging streaming data at speed. One example is IoT devices such as on communications-based train control (CBTC), which utilizes sensor data to provide visibility into train operations, maintain safe operating conditions, and alert on any anomalies. As a real-time use case involving massive datasets and streaming data, CBTC is an excellent fit for Druid, rather than Databricks.

Druid is also a better match for any use cases that may require time series data, such as for healthcare monitoring or financial analysis. Its built-in tools, such as interpolation and automatic backfill, can help clean time series data of noise, prepare it for analytics, and simplify the developer experience for challenges such as late-arriving data.

Comparison: Oracle and Druid

Oracle was one of the first relational database management systems (RDBMS) that gained widespread adoption and even a spot in the popular imagination. Founded in 1977, immediately on the heels of the pioneering work done by computer scientist Edgar F. Codd on RDBMS, Oracle has since grown to the third-largest software company worldwide. Its primary customers are large enterprises, with products for cloud software, customer relationship management (CRM), and more.

Oracle stores data in tablespaces, storage units that consist of one or more data files (which themselves consist of data stored on disk). It utilizes a relational data model comprised of tables with rows and columns, enforcing data integrity and relationships through constraints, keys and indexes. As such, Oracle uses SQL to query and manage data, and also supports other variants for more complex operations like triggers or custom functions.

Oracle is capable of supporting a degree of concurrency through multi-version concurrency control (MVCC), a mechanism that enables simultaneous transactions without them interfering with each other. This ensures read consistency and prevents dirty reads, phantom reads, and non-repeatable reads.

In contrast to Druid, Oracle’s primary offering is also a transactional database rather than an analytical one. Therefore, it’s designed for Atomicity, Consistency, Isolation, and Durability (ACID), to maintain the reliability and continuity of transactions. While their product catalog does include data warehouses and analytics, such as Oracle Analytics Cloud, their core is built around a transactional database. While they’re ideal for running daily business operations, it’s not optimized for the demands of complex analytics on large datasets—even less so if these analytics must occur in real time.

Perhaps the biggest disadvantage of Oracle is vendor lock. Despite their extensive product family, some customers may wish to leave Oracle and opt for alternatives—which is very difficult to do. For one, Oracle technologies are primarily proprietary ones; unlike databases like Druid (or indeed, any Apache project), Oracle was never designed to be open source in the first place. In addition, many Oracle systems are tightly integrated with each other, setting up compatibility issues and migration difficulties for any replacement platforms.

Oracle is also infamous for their complex, byzantine licensing and pricing models. Rather than a simple, pay for what you use plan, Oracle may charge by processor, by user, or some combination of criteria. This leads to confusion, and more importantly, surprising charges that may be difficult to dispute or reduce.

The typical Oracle user is an enterprise with lots of legacy systems, often built by employees who are no longer with the organization. At the same time, the inertia built up by decades of use and expertise in Oracle’s ecosystem makes any transition a difficult task, fraught with new training and onboarding, the potential to break plenty of small features or bits of code, and ultimately, a lot of overhead and employee hours.

In contrast, Druid users tend to be younger companies and organizations with fewer legacy systems. That’s not to say that enterprises don’t use Druid—plenty do—but they likely are less reliant on established vendors like Oracle for their core functions. Skunkworks or tiger teams at large corporations may also be drawn to Druid for its flexible deployment options as well, particularly if they are building a prototype or a separate application that is disconnected from older software. In this case, they can choose between running their own open source Druid clusters or going with an Imply option, such as transitioning directly into cloud-based Polaris or splitting the difference between on-premises servers and the cloud with Hybrid. 

Why choose Apache Druid?

Apache Druid is the database for for speed, scale, and streaming data. As such, it was created to address the challenges of ingesting streaming data through software such as Apache Kafka or Amazon Kinesis, and to structure data for subsecond response times under load.

Druid isn’t a transactional database or a generalist analytics database, but it does excel at real-time analytics. As data increases in speed and volume, end users—whether it’s organizations or applications—have to act immediately, which is where Druid comes in. By executing fast, complex analytical queries on large volumes of data for many parallel users, Druid lends itself to challenges such as fraud prevention, industrial safety, healthcare monitoring, instant personalization, and much more. 

Imply also features paid products including Polaris, a database-as-a-service and the easiest way to get started with Druid. Another valuable product is Pivot, an intuitive GUI for building rich, interactive visualizations and dashboards for either external and internal use.

To learn more about Druid, read our architecture guide
To learn more about real-time analytics, request a free demo of Imply Polaris, the Apache Druid database-as-a-service, or watch this webinar.

Other blogs you might find interesting

No records found...
Feb 21, 2024

What’s new in Imply Polaris – January 2024

At Imply, we're excited to share the latest enhancements in Imply Polaris, our real-time analytics Database-as-a-Service (DBaaS) powered by Apache Druid®. Our commitment to refining your experience with Polaris...

Learn More
Feb 21, 2024

Introducing Apache Druid 29.0

Apache Druid® is an open-source distributed database designed for real-time analytics at scale. We are excited to announce the release of Apache Druid 29.0. This release contains over 350 commits & 67 contributors.

Learn More
Feb 14, 2024

Apache Druid vs. ClickHouse

If your project needs a real-time analytics database that provides subsecond performance at scale you should consider both Apache Druid and ClickHouse. Find out how to make an informed choice.

Learn More
Jan 23, 2024

Enhancing Data Security with Role-Based Access Control in Druid and Imply

Managing user access to relevant data is a crucial aspect of any data platform. In a typical Role Based Access Control (RBAC) setup, users are assigned roles that determine their access to relevant data. We...

Learn More
Jan 16, 2024

Comparing Data Formats for Analytics: Parquet, Iceberg, and Druid Segments

In this blog, I will give you a detailed overview of each choice. We will cover key features, benefits, defining characteristics, and provide a table comparing the file formats. Dive in and explore the characteristics...

Learn More
Jan 12, 2024

Scheduling batch ingestion with Apache Airflow

This guide is your map to navigating the confluence of Airflow and Druid for smooth batch ingestion. We'll get you started by showing you how to setup Airflow and the Druid Provider and use it to ingest some...

Learn More
Dec 26, 2023

What is IoT Analytics?

Because it deals with fast-moving, real-time data, IoT analytics is uniquely challenging. Learn how to overcome these challenges and how to extract (and act on) valuable insights from IoT data.

Learn More
Dec 19, 2023

OLTP and OLAP Databases: How They Differ and Where to Use Them

Learn about the differences between analytical and transactional databases—their strengths and weaknesses, what they’re used for, and which option to choose for your own use case.

Learn More
Dec 15, 2023

Query from deep storage: Introducing a new performance tier in Apache Druid

Now, Druid offers a simpler, cost-effective solution with its new feature, Query from Deep Storage. This feature enables you to query Druid’s deep storage layer directly without having to preload all of your...

Learn More
Dec 15, 2023

How KakaoBank Uses Imply for Financial Analysis

As a mobile-first digital platform, KakaoBank accumulates a substantial amount of data. Therefore, analysts need a solution that can effectively analyze and pre-process large quantities of data, visualize the...

Learn More
Dec 14, 2023

Joins, Multi-Stage Queries, and More: Relive the Excitement of Druid Summit 2023

Druid Summit kicked off its fourth year as a global gathering of minds passionate about real-time analytics and the power of Apache Druid. This year’s event revealed a common theme: the growing significance...

Learn More
Dec 13, 2023

An Introduction to Online Analytical Processing (OLAP)

Online analytical processing (OLAP) analyzes data at scale—and provides actionable insights to organizations. Learn about how OLAP works, what a data cube is, and which OLAP product to use.

Learn More
Dec 12, 2023

Real-Time Data: What it is, Why it Matters, and More

Real-time data travels directly from the source to end users, so that it can be processed and acted on instantly. Learn all about the challenges, benefits, and best practices for real-time data.

Learn More
Dec 08, 2023

Druid vs Pinot: Choosing the best database for Real-Time Analytics

Do you want fast analytics, with subsecond queries, high concurrency, and combination of streams and batch data? If so, you want real-time analytics, and you probably want to consider the two Apache Software...

Learn More
Dec 07, 2023

What’s new in Imply Polaris – October and November 2023

At Imply, our commitment to continually improving your experience with Imply Polaris—our real-time analytics Database-as-a-Service (DBaaS) powered by Apache Druid®—is evident in recent developments. Over...

Learn More
Nov 15, 2023

Introducing Apache Druid 28.0.0

Apache Druid 28.0, an open-source database for real-time analytics, introduces Async queries, UNION ALL support, SQL WINDOW functions, enhanced ingestion features, including multi-Kafka topic support, and...

Learn More
Oct 18, 2023

Migrating Data From S3 To Apache Druid

This blog covers the rationale, advantages, and step-by-step process for data transfer from AWS s3 to Apache Druid for faster real-time analytics and querying.

Learn More
Oct 12, 2023

What’s new in Imply Polaris, our real-time analytics DBaaS  – September 2023

Every week, we add new features and capabilities to Imply Polaris. Throughout September, we've focused on enhancing your experience as you explore trials, navigate data integration, oversee data management,...

Learn More
Sep 27, 2023

Introducing incremental encoding for Apache Druid dictionary encoded columns

In this blog post we deep dive on a recent engineering effort: incremental encoding of STRING columns. In preliminary testing, it has shown to be quite promising at significantly reducing the size of segment...

Learn More
Sep 21, 2023

Migrate Analytics Data from MongoDB to Apache Druid

This blog presents a concise guide on migrating data from MongoDB to Druid. It includes Python scripts to extract data from MongoDB, save it as CSV, and then ingest it into Druid. It also touches on maintaining...

Learn More
Sep 21, 2023

How Druid Facilitates Real-Time Analytics for Mass Transit

Mass transit plays a key role in reimagining life in a warmer, more densely populated world. Learn how Apache Druid helps power data and analytics for mass transit.

Learn More
Sep 19, 2023

Migrate Analytics Data from Snowflake to Apache Druid

This blog outlines the steps needed to migrate data from Snowflake to Apache Druid, a platform designed for high-performance analytical queries. The article covers the migration process, including Python scripts...

Learn More
Sep 15, 2023

Apache Kafka, Flink, and Druid: Open Source Essentials for Real-Time Data Applications

Apache Kafka, Flink, and Druid, when used together, create a real-time data architecture that eliminates all these wait states. In this blog post, we’ll explore how the combination of these tools enables...

Learn More
Sep 11, 2023

Visualizing Data in Apache Druid with the Plotly Python Library

In today's data-driven world, making sense of vast datasets can be a daunting task. Visualizing this data can transform complicated patterns into actionable insights. This blog delves into the utilization of...

Learn More
Sep 05, 2023

Bringing Real-Time Data to Solar Power with Apache Druid

In a rapidly warming world, solar power is critical for decarbonization. Learn how Apache Druid empowers a solar equipment manufacturer to provide real-time data to users, from utility plant operators to homeowners

Learn More
Sep 05, 2023

When to Build (Versus Buy) an Observability Application

Observability is the key to software reliability. Here’s how to decide whether to build or buy your own solution—and why Apache Druid is a popular database for real-time observability

Learn More
Aug 29, 2023

How Innowatts Simplifies Utility Management with Apache Druid

Data is a key driver of progress and innovation in all aspects of our society and economy. By bringing digital data to physical hardware, the Internet of Things (IoT) bridges the gap between the online and...

Learn More
Aug 14, 2023

Three Ways to Use Apache Druid for Machine Learning Workflows

An excellent addition to any machine learning environment, Apache Druid® can facilitate analytics, streamline monitoring, and add real-time data to operations and training

Learn More
Aug 11, 2023

Introducing Apache Druid 27.0.0

Apache Druid® is an open-source distributed database designed for real-time analytics at scale. Apache Druid 27.0 contains over 350 commits & 46 contributors. This release's focus is on stability and scaling...

Learn More
Aug 10, 2023

Unleashing Real-Time Analytics in APJ: Introducing Imply Polaris on AWS AP-South-1

Imply, the company founded by the original creators of Apache Druid, has exciting news for developers in India seeking to build real-time analytics applications. Introducing Imply Polaris, a powerful database-as-a-Service...

Learn More
Aug 03, 2023

Embedding Visualizations using React and Express

In this guide, we will walk you through creating a very simple web app that shows a different embedded chart for each user selected from a drop-down. While this example is simple it highlights the possibilities...

Learn More
Jul 25, 2023

Apache Druid: Making 1000+ QPS for Analytics Look Easy

This 2-part blog post explores key technical considerations to support high QPS for analytics and the strengths of Apache Druid

Learn More
Jul 25, 2023

Things to Consider When Scaling Analytics for High QPS

This 2-part blog post explores key technical considerations to support high QPS for analytics and the strengths of Apache Druid

Learn More
Jul 20, 2023

Automate Streaming Data Ingestion with Kafka and Druid

In this blog post, we explore the integration of Kafka and Druid for data stream management and analysis, emphasizing automatic topic detection and ingestion. We delve into the creation of 'Ingestion Spec',...

Learn More
Jul 12, 2023

Schema Auto-Discovery with Apache Druid

This guide explores configuring Apache Druid to receive Kafka streaming messages. To demonstrate Druid's game-changing automatic schema discovery. Using a real-world scenario where data changes are handled...

Learn More
Jul 11, 2023

What’s new in Imply Polaris – Q2 2023

Imply Polaris, our ever-evolving Database-as-a-Service, recently focused on global expansion, enhanced security, and improved data handling and visualization. This fully managed cloud service, based on Apache...

Learn More
Jun 06, 2023

Introducing hands-on developer tutorials for Apache Druid

The objective of this blog is to introduce the new set of interactive tutorials focused on the Druid API fundamentals. These tutorials are available as Jupyter Notebooks and can be downloaded as a Docker container.

Learn More
Jun 01, 2023

Introducing Schema Auto-Discovery in Apache Druid

In this blog article I’ll unpack schema auto-discovery, a new feature now available in Druid 26.0, that enables Druid to automatically discover data fields and data types and update tables to match changing...

Learn More
May 30, 2023

Exploring Unnest in Druid

Druid now has a new function, Unnest. Unnest explodes an array into individual elements. This blog contains design methodology and examples for this new Unnest function both from native and SQL binding perspectives.

Learn More
May 28, 2023

What’s new in Imply Polaris – Our Real-Time Analytics DBaaS

Every week we add new features and capabilities to Imply Polaris. This month, we’ve expanded security capabilities, added new query functionality, and made it easier to monitor your service with your preferred...

Learn More
May 24, 2023

Introducing Apache Druid 26.0

Apache Druid® 26.0, an open-source distributed database for real-time analytics, has seen significant improvements with 411 new commits, a 40% increase from version 25.0. The expanded contributor base of 60...

Learn More
May 22, 2023

ACID and Apache Druid

ACID and Druid, an interesting dive into some of the Druid capabilities in the light of ACID compliance

Learn More
May 21, 2023

How to Build a Sentiment Analysis Application with ChatGPT and Druid

Leveraging ChatGPT for sentiment analysis, when combined with Apache Druid, offers results from large data volumes. This integration is easily achievable, revealing valuable insights and trends for businesses...

Learn More
May 21, 2023

Snowflake and Apache Druid

In this blog, we will compare Snowflake and Druid. It is important to note that reporting data warehouses and real-time analytics databases are different domains. Choosing the right tool for your specific requirements...

Learn More
May 20, 2023

Learn how to achieve sub-second responses with Apache Druid

Learn how to achieve sub-second responses with Apache Druid. This article is an in-depth look at how Druid resolves queries and describes data modeling techniques that improve performance.

Learn More
May 19, 2023

Apache Druid – Recovering Dropped Segments

Apache Druid uses load rules to manage the ageing of segments from one historical tier to another and finally to purge old segments from the cluster. In this article, we’ll show what happens when you make...

Learn More
May 18, 2023

Real-Time Analytics: Building Blocks and Architecture

This blog identifies the key technical considerations for real-time analytics. It answers what is the right data architecture and why. It spotlights the technologies used at Confluent, Reddit, Target and 1000s...

Learn More
May 17, 2023

Transactions Come and Go, but Events are Forever

For decades, analytics has focused on Transactions. While Transactions are still important, the future of analytics is understanding Events.

Learn More
May 16, 2023

What’s new in Imply Polaris – Our Real-Time Analytics DBaaS

This blog explains some of the new features, functionality and connectivity added to Imply Polaris over the last two months. We've expanded ingestion capabilities, simplified operations and increased reliability...

Learn More
May 15, 2023

Elasticsearch and Druid

This blog will help you understand what Elasticsearch and Druid do well and will help you decide whether you need one or both to reach your goals

Learn More
May 14, 2023

Wow, that was easy – Up and running with Apache Druid

The objective of this blog is to provide a step-by-step guide on setting up Druid locally, including the use of SQL ingestion for importing data and executing analytical queries.

Learn More
May 13, 2023

Top 7 Questions about Kafka and Druid

Read on to learn more about common questions and answers about using Kafka with Druid.

Learn More
May 12, 2023

Tales at Scale Podcast Kicks off with the Apache Druid Origin Story

Tales at Scale cracks open the world of analytics projects and shares stories from developers and engineers who are building analytics applications or working within the real-time data space. One of the key...

Learn More
May 11, 2023

Real-time Analytics Database uses partitioning and pruning to achieve its legendary performance

Apache Druid uses partitioning (splitting data) and pruning (selecting subset of data) to achieve its legendary performance. Learn how to use the CLUSTERED BY clause during ingestion for performance and high...

Learn More
May 10, 2023

Easily embed analytics into your own apps with Imply’s DBaaS

This blog explains how developers can leverage Imply Polaris to embed robust visualization options directly into their own applications without them having to build a UI. This is super important because consuming...

Learn More
May 09, 2023

Building an Event Analytics Pipeline with Confluent Cloud and Imply’s real time DBaaS, Polaris

Learn how to set up a pipeline that generates a simulated clickstream event stream and sends it to Confluent Cloud, processes the raw clickstream data using managed ksqlDB in Confluent Cloud, delivers the processed...

Learn More
May 08, 2023

Real time DBaaS comes to Europe

We are excited to announce the availability of Imply Polaris in Europe, specifically in AWS eu-central-1 region based in Frankfurt. Since its launch in March 2022, Imply Polaris, the fully managed Database-as-a-Service...

Learn More
May 07, 2023

Stream big, think bigger—Analyze streaming data at scale in 2023

Imply is predicting the next "big thing" in 2023 will be analyzing streaming data in real time (and Druid is built for just that!)

Learn More
May 07, 2023

Should You Build or Buy Security Analytics for SecOps?

When should you build—or buy—a security analytics platform for your environment? Here are some common considerations—and how Apache Druid is the ideal foundation for any in-house security solution.

Learn More
May 05, 2023

Introducing Apache Druid 25.0

Apache Druid 25.0 contains over 293 updates from over 56 contributors.

Learn More
May 03, 2023

Druid and SQL syntax

This is a technical blog, which summarises the process of extending the Druid's SQL grammar for ingestion and delves into the nitty gritty of Calcite.

Learn More
May 02, 2023

Native support for semi-structured data in Apache Druid

Describes a new feature- ingest complex data as is into Druid- massive improvement in developer productivity

Learn More
May 01, 2023

Real-Time Analytics with Imply Polaris: From Setup to Visualization

Imply Polaris offers reduced operational overhead and elastic scaling for efficient real-time analytics that helps you unlock your data's potential.

Learn More
May 01, 2023

Datanami Award

Apache Druid won Datanami's 2022 Readers’ and Editors’ Choice Awards for Reader's Choice "Best Data and AI Product or Technology: Analytics Database".

Learn More
Apr 30, 2023

Alerting and Security Features in Polaris

Describes new features - alerts and some security features- and how Imply customers can leverage it

Learn More
Apr 29, 2023

Ingestion from Amazon Kinesis and S3 into Imply Polaris

Imply Polaris now supports data ingestion from Amazon Kinesis and Amazon S3

Learn More
Apr 27, 2023

Getting the Most Out of your Data

Ingesting data from one table to another is easy and fast in Imply Polaris!

Learn More
Apr 26, 2023

Combating financial fraud and money laundering at scale with Apache Druid

Learn how Apache Druid enables financial services firms and FinTech companies to get immediate insights from petabytes-plus data volumes for anti-fraud and anti-money laundering compliance.

Learn More
Apr 26, 2023

What’s new in Imply – December 2022

This is a what's new to Imply in Dec 2022. We’ve added two new features to Imply Polaris to make it easier for your end users to take advantage of real-time insights.

Learn More
Apr 25, 2023

What’s New in Imply Polaris – November 2022

This blog provides an overview for the new features, functionality, and connectivity to Imply Polaris for November 2022.

Learn More
Apr 24, 2023

Imply Pivot delivers the final mile for modern analytics applications

This blog is focused on how Imply Pivot delivers the final mile for building an anlaytics app. It showcases two customer examples - Twitch and ironsource.

Learn More
Apr 23, 2023

Why Analytics Need More than a Data Warehouse

For decades, analytics has been defined by the standard reporting and BI workflow, supported by the data warehouse. Now, 1000s of companies are realizing an expansion of analytics beyond reporting, which requires...

Learn More
Apr 21, 2023

Why Open Source Matters for Databases

Apache Druid is at the heart of Imply. We’re an open source business, and that’s why we’re committed to making Druid the best open source database for modern analytics applications

Learn More
Apr 20, 2023

Ingestion from Confluent Cloud and Kafka in Polaris

How to ingest data into Imply Polaris from Confluent Cloud and from Apache Kafka

Learn More
Apr 18, 2023

What Makes a Database Built for Streaming Data?

For an analytics app to handle real-time, streaming sources, it must be built for streaming data. Druid has 3 essential features for stream data.

Learn More
Oct 12, 2022

SQL-based Transformations and JSON Columns in Imply Polaris

You can easily do data transformations and manage JSON data with Imply Polaris, both using SQL.

Learn More
Oct 06, 2022

Approximate Distinct Counts in Imply Polaris

When it comes to modern data analytics applications, speed is of the utmost importance. In this blog we discuss two approximation algorithms which can be used to greatly enhance speed with only a slight reduction...

Learn More
Sep 20, 2022

The next chapter for Imply Polaris: celebrating 250+ accounts, continued innovation

Today we announced the next iteration of Imply Polaris, the fully managed Database-as-a-Service that helps you build modern analytics applications faster, cheaper, and with less effort. Since its launch in...

Learn More
Sep 20, 2022

Introducing Imply’s Total Value Guarantee for Apache Druid

Apache Druid 24.0 contains 450 updates and new features, major performance enhancements, bug fixes, and major documentation improvements

Learn More
Sep 16, 2022

Introducing Apache Druid 24.0

Apache Druid 24.0 contains 450 updates and new features, major performance enhancements, bug fixes, and major documentation improvements

Learn More
Aug 16, 2022

Using Imply Pivot with Druid to Deduplicate Timeseries Data

Imply Pivot offers multi step aggregations, which is valuable for timeseries data where measures are not evenly distributed in time.

Learn More
Jul 21, 2022

A Look Under the Surface at Polaris Security

We have taken a security-first approach in building the easiest real-time database for modern analytics applications.

Learn More
Jul 14, 2022

Upserts and Data Deduplication with Druid

A look at what can be done with Druid for upserts and data deduplication.

Learn More
Jul 01, 2022

What Developers Can Build with Apache Druid

We obviously talk a lot about #ApacheDruid on here. But what are folks actually building with Druid? What is a modern analytics application, exactly? Let's find out

Learn More
Jun 29, 2022

When Streaming Analytics… Isn’t

Nearly all databases are designed for batch processing, which leaves three options for stream analytics.

Learn More
Jun 29, 2022

Apache Druid vs. Snowflake

Elasticity is important, but beware the database that can only save you money when your application is not in use. The best solution will have excellent price-performance under all conditions.

Learn More
Jun 22, 2022

Druid 0.23 – Features And Capabilities For Advanced Scenarios

Many of Druid’s improvements focus on building a solid foundation, including making the system more stable, easier to use, faster to scale, and better integrated with the rest of the data ecosystem. But for...

Learn More
Jun 22, 2022

Introducing Apache Druid 0.23

Apache Druid 0.23.0 contains over 450 updates, including new features, major performance enhancements, bug fixes, and major documentation improvements.

Learn More
Jun 20, 2022

An Opinionated Guide to Component APIs

We have collected a number of guidelines for React component APIs that make components more predictable in terms of behavior and performance.

Learn More
Jun 10, 2022

Druid Architecture & Concepts

In a world full of databases, learn how Apache Druid makes real-time analytics apps a reality in this Whitepaper from Imply

Learn More
May 25, 2022

3 decisions that shaped the Polaris UI

Imply Polaris is a fully managed database-as-a-service for building realtime analytics applications. John is the tech lead for the Polaris UI, known internally as the Unified App. It began with a profound question:...

Learn More
May 19, 2022

How Imply Polaris takes a security-first approach

A primer for developers on security tools and controls available in Imply Polaris

Learn More
May 17, 2022

Imply Raises $100MM in Series D funding

There is a new category within data analytics emerging which is not centered in the world of reports and dashboards (the purview of data analysts and data scientists), but instead centered in the world of applications...

Learn More
May 11, 2022

Imply Named “Cool Database Vendor” by CRN

There can’t be one database good at everything. When it comes to real-time analytics, you need a database built for it.

Learn More
May 11, 2022

Living the Stream

We are in the early stages of a stream revolution, as developers build modern transactional and analytic applications that use real-time data continuously delivered.

Learn More
May 02, 2022

Migrating Data from ClickHouse to Imply Polaris

In this blog, we’ll review the simple steps to export data from ClickHouse in a format that is easy to ingest into Polaris.

Learn More
Apr 06, 2022

Java Keytool, TLS, and Zookeeper Security

Lean the basics of Public Key Infrastructure (PKI) as it relates to Druid and Zookeeper security.

Learn More
Apr 01, 2022

Building high performance logging analytics with Polaris and Logstash

When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...

Learn More
Apr 01, 2022

For April 1st: a New Description of Apache Druid from Our Youngest Technical Architect

A simple set of instructions to deploy Apache Druid on minikube using minio for local deep storage on your laptop.

Learn More

Let us help with your analytics apps

Request a Demo