Building high performance logging analytics with Polaris and Logstash
Apr 01, 2022
Jad Naous
The current crop of logging solutions out there is great for unstructured search. However, these solutions fall flat when it comes to analytics on potentially hundreds of columns with potentially high cardinality. This makes it particularly problematic to analyze logs such as HTTP access logs which may contain millions of IP addresses or paths and billions of url paths.
At this point users often reach out for systems like Druid. In this blog, I’ll show you how to connect Logstash, a tool for collecting logs typically used as part of the ELK (Elasticsearch + Logstash + Kibana) stack, to push data to Imply Polaris, our Database-as-a-service offering powered by Druid. As part of the solution, we’ll also be using Polaris’s built-in analytics capabilities, instead of Kibana for visualizations and analysis.
Logstash, the L in ELK, is a tool that helps aggregate logs, parse them, and push them to a sink. You can think of it as running one or more processing pipelines where events come in on one end, and structured events get sent out of the other. For the purposes of this tutorial, let’s set up an Apache HTTP server with Logstash configured to read access logs and forward them to Polaris. In a production setup, you’re more likely to use Filebeat and connect it to a more centralized Logstash server that parses and forwards events to Polaris.
We’ll be using Logstash’s file input plugin and HTTP output plugin to read the access logs and push them to Imply.
Set up Polaris
If you haven’t already done so, sign up for a Polaris account at https://imply.io/polaris-signup and log in. You can proceed with the other pieces of this tutorial while your environment is spinning up and come back here when ready.
Create an API client
Now that you’ve logged in, create an “API client” to identify your application with Polaris. Polaris uses OAuth2 to authenticate API calls. Simply navigate to the User management console and then click API Clients:
There, create a client. Because Logstash’s HTTP endpoint doesn’t support OAuth out of the box, we’re going to create a long-lived access token that will be used in this example. Set the Access Token Lifespan to the duration for this tutorial, say 1 day. Go to the Tokens tab and download the token. Open the downloaded JSON file, and get the token. You’ll need this token later for your Logstash configuration.
Be careful with this access token. Anyone in possession of that token can execute any APIs as you. A better approach is to create a user with more limited access (a more targeted set of permissions) that can only send data. An even better approach is to write some code using Logstash’s Ruby filter plugin to acquire an access token as needed, but that’s beyond the scope of this blog. See the Polaris Developer guide for more details.
Create a table
Next, in the Data section in the left navigation, click Tables > Create table. Give your table a name. I called mine “accesslogs”.
In the table detail page, click Edit Schema. Add the following columns:
host_name as string to store the name of the host where the log message is read
http_version as string to store the HTTP version on the request
http_request_referrer as string to store the referrer, if present
http_request_method as string to store the verb used on the request, ie POST, GET, etc
url as string to store the request url
source_address as string to store the client address
user_agent as string to store the User-Agent header if present
http_response_status_code as string to store the status code. I chose string as type because Druid does not index numeric fields (yet!)
http_response_body_bytes as long to store the size of the response body
After you save the schema, the UI displays the API endpoint where you can push individual events. Copy that endpoint URL and save it! You’ll use it later to push logs to the table.
Set up Apache HTTP Server
If you don’t already have a working web server with logs you’d like to use, you can follow this section to get a server working on Ubuntu. For this tutorial, I installed Ubuntu 20.04.4 LTS into a VM. Then I installed the Apache server using:
If you already have a working server, you might just want to use the access logs from that server. This tutorial will use the default access log path on Ubuntu at /var/log/apache2/access.log.
Create a file called my-polaris-pipeline.conf inside the config directory with the content below. Replace the <events_endpoint> and <access_token> placeholders with the URL of your table’s endpoint and your access token respectively:
Display Logstash my-polaris-pipeline.conf
input {
# This section tells logstash to read the access log file
file {
mode => “tail”
path => [“/var/log/apache2/access.log”]
start_position => “beginning”
}
}
# This section parses the access log
filter {
# Use the grok filter to parse the apache access log here
grok {
match => {
“message” => “%{COMBINEDAPACHELOG}”
}
}
# We need to parse the timestamp coming from the access log correctly. This tells logstash how
# to read the time from the access log
date {
match => [ “timestamp”, “dd/MMM/yyyy:HH:mm:ss Z”]
}
# And we need to flatten the fields that we’re reading from the log to push them
mutate {
rename => {
“[http][version]” => “[http_version]”
“[http][request][method]” => “[http_request_method]”
“[http][request][referrer]” => “[http_request_referrer]”
“[http][response][status_code]” => “[http_response_status_code]”
“[http][response][body][bytes]” => “[http_response_body_bytes]”
“[url][original]” => “[url]”
“[source][address]” => “[source_address]”
“[host][name]” => “[host_name]”
“[user_agent][original]” => “[user_agent]”
}
# Polaris requires a __time field. We store the parsed event timestamp in __time
copy => {
“@timestamp” => “__time”
}
# We remove everything else that we don’t need
remove_field => [ “http”, “event”, “process”, “source”, “host” ]
}
}
# This section tells Logstash where to send the data
output {
# Use the http output plugin
http {
# Set the URL to your table’s event endpoint here by replacing EVENTS_ENDPOINT
“url” => “EVENTS_ENDPOINT”
“http_method” => “post”
“format” => “json”
# This is where we’ll use the access token. Replace ACCESS_TOKEN below with your access token from the downloaded file.
“headers” => {
“Authorization” => “Bearer ACCESS_TOKEN”
}
}
}
From the logstash directory, start Logstash as follows:
Navigate to your Polaris environment and open your table detail to see your log data flowing!
Use cURL to run HTTP requests to generate some Apache access log entries. For example:
$curl http://localhost
Explore data with Polaris analytics
Let’s create a data cube with Polaris to visualize our data:
Click Data cubes under Analytics in the left navigation
Click New data cube on the top right.
Select your table from the drop down and click Next: Create data cube.
You don’t need to modify the data cube settings. Just click Save.
Now you can access your new data cube to explore your data:
This blog won’t go into too much detail about visualizations. For that, I recommend reading through the documentation for Pivot, the Imply Enterprise data visualization tool. For now, let’s try a couple of things.
First, let’s break down the records by status code. Drag and drop “Http Response Status Code” from the left panel under Dimensions to the Show bar:
Interesting. I didn’t even try to generate a 408 response. I wonder what that is. Click on the 408 response and select Filter:
Now let’s view that record directly. Click the visualization selector, currently Table on the top right, and choose the Record view:
Cool! You can now use Polaris’s built-in analytics capabilities to slice and dice the Logstash data and build dashboards.
In this blog I’ve shown you how to do a basic configuration of Logstash to parse Apache HTTP Server access logs, push them to Polaris, and then analyze them with Pivot.
Migrate Analytics Data from MongoDB to Apache Druid
This blog presents a concise guide on migrating data from MongoDB to Druid. It includes Python scripts to extract data from MongoDB, save it as CSV, and then ingest it into Druid. It also touches on maintaining...
How Druid Facilitates Real-Time Analytics for Mass Transit
Mass transit plays a key role in reimagining life in a warmer, more densely populated world. Learn how Apache Druid helps power data and analytics for mass transit.
Migrate Analytics Data from Snowflake to Apache Druid
This blog outlines the steps needed to migrate data from Snowflake to Apache Druid, a platform designed for high-performance analytical queries. The article covers the migration process, including Python scripts...
Apache Kafka, Flink, and Druid: Open Source Essentials for Real-Time Applications
Apache Kafka, Flink, and Druid, when used together, create a real-time data architecture that eliminates all these wait states. In this blog post, we’ll explore how the combination of these tools enables...
Visualizing Data in Apache Druid with the Plotly Python Library
In today's data-driven world, making sense of vast datasets can be a daunting task. Visualizing this data can transform complicated patterns into actionable insights. This blog delves into the utilization of...
Bringing Real-Time Data to Solar Power with Apache Druid
In a rapidly warming world, solar power is critical for decarbonization. Learn how Apache Druid empowers a solar equipment manufacturer to provide real-time data to users, from utility plant operators to homeowners
When to Build (Versus Buy) an Observability Application
Observability is the key to software reliability. Here’s how to decide whether to build or buy your own solution—and why Apache Druid is a popular database for real-time observability
How Innowatts Simplifies Utility Management with Apache Druid
Data is a key driver of progress and innovation in all aspects of our society and economy. By bringing digital data to physical hardware, the Internet of Things (IoT) bridges the gap between the online and...
Three Ways to Use Apache Druid for Machine Learning Workflows
An excellent addition to any machine learning environment, Apache Druid® can facilitate analytics, streamline monitoring, and add real-time data to operations and training
Apache Druid® is an open-source distributed database designed for real-time analytics at scale. Apache Druid 27.0 contains over 350 commits & 46 contributors. This release's focus is on stability and scaling...
Unleashing Real-Time Analytics in APJ: Introducing Imply Polaris on AWS AP-South-1
Imply, the company founded by the original creators of Apache Druid, has exciting news for developers in India seeking to build real-time analytics applications. Introducing Imply Polaris, a powerful database-as-a-Service...
In this guide, we will walk you through creating a very simple web app that shows a different embedded chart for each user selected from a drop-down. While this example is simple it highlights the possibilities...
Automate Streaming Data Ingestion with Kafka and Druid
In this blog post, we explore the integration of Kafka and Druid for data stream management and analysis, emphasizing automatic topic detection and ingestion. We delve into the creation of 'Ingestion Spec',...
This guide explores configuring Apache Druid to receive Kafka streaming messages. To demonstrate Druid's game-changing automatic schema discovery. Using a real-world scenario where data changes are handled...
Imply Polaris, our ever-evolving Database-as-a-Service, recently focused on global expansion, enhanced security, and improved data handling and visualization. This fully managed cloud service, based on Apache...
Introducing hands-on developer tutorials for Apache Druid
The objective of this blog is to introduce the new set of interactive tutorials focused on the Druid API fundamentals. These tutorials are available as Jupyter Notebooks and can be downloaded as a Docker container.
In this blog article I’ll unpack schema auto-discovery, a new feature now available in Druid 26.0, that enables Druid to automatically discover data fields and data types and update tables to match changing...
Druid now has a new function, Unnest. Unnest explodes an array into individual elements. This blog contains design methodology and examples for this new Unnest function both from native and SQL binding perspectives.
What’s new in Imply Polaris – Our Real-Time Analytics DBaaS
Every week we add new features and capabilities to Imply Polaris. This month, we’ve expanded security capabilities, added new query functionality, and made it easier to monitor your service with your preferred...
Apache Druid® 26.0, an open-source distributed database for real-time analytics, has seen significant improvements with 411 new commits, a 40% increase from version 25.0. The expanded contributor base of 60...
How to Build a Sentiment Analysis Application with ChatGPT and Druid
Leveraging ChatGPT for sentiment analysis, when combined with Apache Druid, offers results from large data volumes. This integration is easily achievable, revealing valuable insights and trends for businesses...
In this blog, we will compare Snowflake and Druid. It is important to note that reporting data warehouses and real-time analytics databases are different domains. Choosing the right tool for your specific requirements...
Learn how to achieve sub-second responses with Apache Druid
Learn how to achieve sub-second responses with Apache Druid. This article is an in-depth look at how Druid resolves queries and describes data modeling techniques that improve performance.
Apache Druid uses load rules to manage the ageing of segments from one historical tier to another and finally to purge old segments from the cluster. In this article, we’ll show what happens when you make...
Real-Time Analytics: Building Blocks and Architecture
This blog identifies the key technical considerations for real-time analytics. It answers what is the right data architecture and why. It spotlights the technologies used at Confluent, Reddit, Target and 1000s...
What’s new in Imply Polaris – Our Real-Time Analytics DBaaS
This blog explains some of the new features, functionality and connectivity added to Imply Polaris over the last two months. We've expanded ingestion capabilities, simplified operations and increased reliability...
Wow, that was easy – Up and running with Apache Druid
The objective of this blog is to provide a step-by-step guide on setting up Druid locally, including the use of SQL ingestion for importing data and executing analytical queries.
Tales at Scale Podcast Kicks off with the Apache Druid Origin Story
Tales at Scale cracks open the world of analytics projects and shares stories from developers and engineers who are building analytics applications or working within the real-time data space. One of the key...
Real-time Analytics Database uses partitioning and pruning to achieve its legendary performance
Apache Druid uses partitioning (splitting data) and pruning (selecting subset of data) to achieve its legendary performance. Learn how to use the CLUSTERED BY clause during ingestion for performance and high...
Easily embed analytics into your own apps with Imply’s DBaaS
This blog explains how developers can leverage Imply Polaris to embed robust visualization options directly into their own applications without them having to build a UI. This is super important because consuming...
Building an Event Analytics Pipeline with Confluent Cloud and Imply’s real time DBaaS, Polaris
Learn how to set up a pipeline that generates a simulated clickstream event stream and sends it to Confluent Cloud, processes the raw clickstream data using managed ksqlDB in Confluent Cloud, delivers the processed...
We are excited to announce the availability of Imply Polaris in Europe, specifically in AWS eu-central-1 region based in Frankfurt. Since its launch in March 2022, Imply Polaris, the fully managed Database-as-a-Service...
Should You Build or Buy Security Analytics for SecOps?
When should you build—or buy—a security analytics platform for your environment? Here are some common considerations—and how Apache Druid is the ideal foundation for any in-house security solution.
Combating financial fraud and money laundering at scale with Apache Druid
Learn how Apache Druid enables financial services firms and FinTech companies to get immediate insights from petabytes-plus data volumes for anti-fraud and anti-money laundering compliance.
This is a what's new to Imply in Dec 2022. We’ve added two new features to Imply Polaris to make it easier for your end users to take advantage of real-time insights.
Imply Pivot delivers the final mile for modern analytics applications
This blog is focused on how Imply Pivot delivers the final mile for building an anlaytics app. It showcases two customer examples - Twitch and ironsource.
For decades, analytics has been defined by the standard reporting and BI workflow, supported by the data warehouse. Now, 1000s of companies are realizing an expansion of analytics beyond reporting, which requires...
Apache Druid is at the heart of Imply. We’re an open source business, and that’s why we’re committed to making Druid the best open source database for modern analytics applications
When it comes to modern data analytics applications, speed is of the utmost importance. In this blog we discuss two approximation algorithms which can be used to greatly enhance speed with only a slight reduction...
The next chapter for Imply Polaris: celebrating 250+ accounts, continued innovation
Today we announced the next iteration of Imply Polaris, the fully managed Database-as-a-Service that helps you build modern analytics applications faster, cheaper, and with less effort. Since its launch in...
We obviously talk a lot about #ApacheDruid on here. But what are folks actually building with Druid? What is a modern analytics application, exactly? Let's find out
Elasticity is important, but beware the database that can only save you money when your application is not in use. The best solution will have excellent price-performance under all conditions.
Druid 0.23 – Features And Capabilities For Advanced Scenarios
Many of Druid’s improvements focus on building a solid foundation, including making the system more stable, easier to use, faster to scale, and better integrated with the rest of the data ecosystem. But for...
Apache Druid 0.23.0 contains over 450 updates, including new features, major performance enhancements, bug fixes, and major documentation improvements.
Imply Polaris is a fully managed database-as-a-service for building realtime analytics applications. John is the tech lead for the Polaris UI, known internally as the Unified App. It began with a profound question:...
There is a new category within data analytics emerging which is not centered in the world of reports and dashboards (the purview of data analysts and data scientists), but instead centered in the world of applications...
We are in the early stages of a stream revolution, as developers build modern transactional and analytic applications that use real-time data continuously delivered.
Developers and architects must look beyond query performance to understand the operational realities of growing and managing a high performance database and if it will consume their valuable time.
Horizontal scaling is the key to performance at scale, which is why every database claims this. You should investigate, though, to see how much effort it takes, especially compared to Apache Druid.
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
Building Analytics for External Users is a Whole Different Animal
Analytics aren’t just for internal stakeholders anymore. If you’re building an analytics application for customers, then you’re probably wondering…what’s the right database backend?
After over 30 years of working with data analytics, we’ve been witness (and sometimes participant) to three major shifts in how we find insights from data - and now we’re looking at the fourth.
Every year industry pundits predict data and analytics becoming more valuable the following year. But this doesn’t take a crystal ball to predict. There’s instead something much more interesting happening...
Today, I'm prepared to share our progress on this effort and some of our plans for the future. But before diving further into that, let's take a closer look at how Druid's core query engine executes queries,...
Product Update: SSO, Cluster level authorization, OAuth 2.0 and more security features
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
Druid Nails Cost Efficiency Challenge Against ClickHouse & Rockset
To make a long story short, we were pleased to confirm that Druid is 2 times faster than ClickHouse and 8 times faster than Rockset with fewer hardware resources!.
Unveiling Project Shapeshift Nov. 9th at Druid Summit 2021
There is a new category within data analytics emerging which is not centered in the world of reports and dashboards (the purview of data analysts and data scientists), but instead centered in the world of applications...
How we made long-running queries work in Apache Druid
When you think of querying with Apache Druid, you probably imagine queries over massive data sets that run in less than a second. This blog is about some of the things we did as a team to discover the user...
Uneven traffic flow in streaming pipelines is a common problem. Providing the right level of resources to keep up with spikes in demand is a requirement in order to deliver timely analytics.
Community Discoveries: multi-value dimensions in Apache Druid
Hellmar Becker is an Imply solutions engineer based in Germany, where he has been delving into the nooks-and-crannies of multi-valued dimension support in Druid. In this interview, Hellmar explains why...
Community Spotlight: Apache Pulsar and Apache Druid get close…
The community team at Imply spoke with an Apache Pulsar community member, Giannis Polyzos, about how collaboration between open source communities generates great things, and more specifically, about how...
Meet the team: Abhishek Agarwal, engineering lead in India
Abhishek is Imply’s first engineer in India. We spoke to him about setting up our operations in Bangalore and asked what kind of local talent the company is looking for.
Jihoon Son is a software engineer at Imply who works on Apache Druid®. He explains what drew him to Imply five years ago and why he’s even more inspired by the company today.