Apache Druid on Google Cloud Platform (GCP) Reference Architecture
Jan 29, 2020
Matt Sarrel
Apache Druid on Google Cloud Platform (GCP) Reference Architecture
When I approach a new distributed technology, I usually find it helpful to read through a reference architecture and a quickstart or two before I get my hands dirty. It helps me prepare for potential snags like dependencies, permissions, provisioning unsuitable instances; and hopefully avoid some and minimize others.
Muthu Lalapet, a Solutions Architect at Imply, recently wrote a reference architecture for Apache Druid on Google Cloud Platform (GCP) that includes some best practices for leveraging GCP services such as Compute Engine, Cloud Storage and Cloud SQL. The document describes example cluster architectures and their accompanying machine types and configurations. As such, it’s a helpful resource for planning and implementing Druid on GCP.
Apache Druid is a real-time analytics database designed for ultrafast query response on large datasets. Druid can scale to ingest millions of events per second, store trillions of events ( petabytes of data), and perform queries with sub-second response times at scale. Druid’s most common use cases are where real-time (streaming) ingestion, fast query performance, and no downtime are critical. This makes Druid a good choice for operational analytics projects that provide real-time intelligence, the process of delivering information as events occur so that businesses can gain immediate insight. Some of the largest online entities in the world rely on Druid for use cases such as clickstream analytics, application/device/network performance monitoring, and BI/OLAP. While Druid ingests data from a variety of sources, it is commonly paired with Kafka on GCP for event monitoring, financial analysis, and IoT monitoring.
Druid is cloud-native and runs as server types that host groups of processes. At a high level, there’s the Master server to coordinate data ingestion and storage, the Data server to store and ingest data, and the Query server(s) that act as endpoints for users and client applications to interact with. Druid also relies on external metadata storage, deep storage, and Apache Zookeeper to coordinate its processes.
There’s a lot of detail (and years of development) underlying this simple explanation, and you can learn all about it when you download the reference architecture.
Google Cloud Platform is a combination of services and resources built and operated by Google in their global data centers. Google’s many offerings include IaaS, PaaS, and serverless computing options for data storage and data analytics. You can subscribe to replace part of or all of an entire enterprise IT infrastructure, leverage an automated machine learning environment, and leverage open source technology on Google’s hardware.
Many GCP customers are drawn to the platform because of its tight integration with open source frameworks, making them easier to learn, develop for, and operate in production. Druid is one such open source technology, and there are a number of very large Druid deployments on GCP.
While there are many services offered on Google’s infrastructure, Compute Engine, Cloud Storage, and Cloud SQL are the most important components when it comes to running Druid. Compute Engine provides high-performance virtual machines that can be used to run the Druid components described above (Master, Data, Query servers). Druid leverages Google Cloud Storage as deep storage and Cloud SQL as metadata storage. Cloud SQL allows you to run MySQL, PostgreSQL or SQL Server to house Druid metadata, and it automates tasks to create a high-availability metadata environment. As Druid stores segments into deep storage, Google Cloud Storage can provide durability and high-availability to data as long as the Druid processes can connect to it.
The reference architecture provides guidance on the type of instance to choose when getting started, and for those who want to dig deeper, here’s a more general discussion of Druid server sizing and cluster tuning.
Here’s a tip from the reference architecture: Please make sure to use Standard Storage class for best availability and high performance. How’s that for something helpful to learn before provisioning resources?
Pivot by Imply: A High-Speed Data Exploration UI for Druid
In today’s fast-paced world, organizations rely on real-time analytics to make critical decisions. With millions of events streaming in per second, having an intuitive, high-speed data exploration tool to...