Apr 1, 2022

Building high performance logging analytics with Polaris and Logstash

The current crop of logging solutions out there is great for unstructured search. However, these solutions fall flat when it comes to analytics on potentially hundreds of columns with potentially high cardinality. This makes it particularly problematic to analyze logs such as HTTP access logs which may contain millions of IP addresses or paths and billions of url paths.

At this point users often reach out for systems like Druid. In this blog, I’ll show you how to connect Logstash, a tool for collecting logs typically used as part of the ELK (Elasticsearch + Logstash + Kibana) stack, to push data to Imply Polaris, our Database-as-a-service offering powered by Druid. As part of the solution, we’ll also be using Polaris’s built-in analytics capabilities, instead of Kibana for visualizations and analysis.

Logstash, the L in ELK, is a tool that helps aggregate logs, parse them, and push them to a sink. You can think of it as running one or more processing pipelines where events come in on one end, and structured events get sent out of the other. For the purposes of this tutorial, let’s set up an Apache HTTP server with Logstash configured to read access logs and forward them to Polaris. In a production setup, you’re more likely to use Filebeat and connect it to a more centralized Logstash server that parses and forwards events to Polaris.

We’ll be using Logstash’s file input plugin and HTTP output plugin to read the access logs and push them to Imply.

Set up Polaris

If you haven’t already done so, sign up for a Polaris account at https://imply.io/polaris-signup and log in. You can proceed with the other pieces of this tutorial while your environment is spinning up and come back here when ready.

Create an API client

Now that you’ve logged in, create an “API client”  to identify your application with Polaris. Polaris uses OAuth2 to authenticate API calls. Simply navigate to the User management console and then click API Clients:

There, create a client. Because Logstash’s HTTP endpoint doesn’t support OAuth out of the box, we’re going to create a long-lived access token that will be used in this example. Set the Access Token Lifespan to the duration for this tutorial, say 1 day. Go to the Tokens tab and download the token. Open the downloaded JSON file, and get the token. You’ll need this token later for your Logstash configuration.

Be careful with this access token. Anyone in possession of that token can execute any APIs as you. A better approach is to create a user with more limited access (a more targeted set of permissions) that can only send data. An even better approach is to write some code using Logstash’s Ruby filter plugin to acquire an access token as needed, but that’s beyond the scope of this blog. See the Polaris Developer guide for more details.

Create a table

Next, in the Data section in the left navigation, click Tables > Create table. Give your table a name. I called mine “accesslogs”.

In the table detail page, click Edit Schema. Add the following columns:
  • host_name as string to store the name of the host where the log message is read
  • http_version as string to store the HTTP version on the request
  • http_request_referrer as string to store the referrer, if present
  • http_request_method as string to store the verb used on the request, ie POST, GET, etc
  • url as string to store the request url
  • source_address as string to store the client address
  • user_agent as string to store the User-Agent header if present
  • http_response_status_code as string to store the status code. I chose string as type because Druid does not index numeric fields (yet!)
  • http_response_body_bytes as long to store the size of the response body

After you save the schema, the UI displays the API endpoint where you can push individual events. Copy that endpoint URL and save it! You’ll use it later to push logs to the table.

Set up Apache HTTP Server

If you don’t already have a working web server with logs you’d like to use, you can follow this section to get a server working on Ubuntu. For this tutorial, I installed Ubuntu 20.04.4 LTS into a VM. Then I installed the Apache server using:

$ sudo apt install apache2
$ sudo systemctl start apache2

If you already have a working server, you might just want to use the access logs from that server. This tutorial will use the default access log path on Ubuntu at /var/log/apache2/access.log.

Set up Logstash

You can download Logstash from https://elastic.co/downloads/logstash or you can use the following command:

$ wget https://artifacts.elastic.co/downloads/logstash/logstash-8.1.1-linux-x86_64.tar.gz

Untar the package:

$ tar -zxf logstash-8.1.1-linux-x86_64.tar.gz

And now the good stuff.

Navigate to the logstash-8.1.1 directory

Create a file called my-polaris-pipeline.conf inside the config directory with the content below. Replace the <events_endpoint> and <access_token> placeholders with the URL of your table’s endpoint and your access token respectively:

Display Logstash my-polaris-pipeline.conf
input { # This section tells logstash to read the access log file file { mode => “tail” path => [“/var/log/apache2/access.log”] start_position => “beginning” } } # This section parses the access log filter { # Use the grok filter to parse the apache access log here grok { match => { “message” => “%{COMBINEDAPACHELOG}” } } # We need to parse the timestamp coming from the access log correctly. This tells logstash how # to read the time from the access log date { match => [ “timestamp”, “dd/MMM/yyyy:HH:mm:ss Z”] } # And we need to flatten the fields that we’re reading from the log to push them mutate { rename => { “[http][version]” => “[http_version]” “[http][request][method]” => “[http_request_method]” “[http][request][referrer]” => “[http_request_referrer]” “[http][response][status_code]” => “[http_response_status_code]” “[http][response][body][bytes]” => “[http_response_body_bytes]” “[url][original]” => “[url]” “[source][address]” => “[source_address]” “[host][name]” => “[host_name]” “[user_agent][original]” => “[user_agent]” } # Polaris requires a __time field. We store the parsed event timestamp in __time copy => { “@timestamp” => “__time” } # We remove everything else that we don’t need remove_field => [ “http”, “event”, “process”, “source”, “host” ] } } # This section tells Logstash where to send the data output { # Use the http output plugin http { # Set the URL to your table’s event endpoint here by replacing EVENTS_ENDPOINT “url” => “EVENTS_ENDPOINT” “http_method” => “post” “format” => “json” # This is where we’ll use the access token. Replace ACCESS_TOKEN below with your access token from the downloaded file. “headers” => { “Authorization” => “Bearer ACCESS_TOKEN” } } }

From the logstash directory, start Logstash as follows:

$ bin/logstash -f config/my-polaris-pipeline.conf --config.reload.automatic

Navigate to your Polaris environment and open your table detail to see your log data flowing!

Use cURL to run HTTP requests to generate some Apache access log entries. For example:

$curl http://localhost

Explore data with Polaris analytics

Let’s create a data cube with Polaris to visualize our data:

  1. Click Data cubes under Analytics in the left navigation
  2. Click New data cube on the top right.
  3. Select your table from the drop down and click Next: Create data cube.
  4. You don’t need to modify the data cube settings. Just click Save.

Now you can access your new data cube to explore your data:

This blog won’t go into too much detail about visualizations. For that, I recommend reading through the documentation for Pivot, the Imply Enterprise data visualization tool. For now, let’s try a couple of things.

First, let’s break down the records by status code. Drag and drop “Http Response Status Code” from the left panel under Dimensions to the Show bar:

Interesting. I didn’t even try to generate a 408 response. I wonder what that is. Click on the 408 response and select Filter:

Now let’s view that record directly. Click the visualization selector, currently Table on the top right, and choose the Record view:

Cool! You can now use Polaris’s built-in analytics capabilities to slice and dice the Logstash data and build dashboards.

In this blog I’ve shown you how to do a basic configuration of Logstash to parse Apache HTTP Server access logs, push them to Polaris, and then analyze them with Pivot.

Learn more at https://docs.imply.io/polaris. Start your free trial at https://imply.io/polaris-signup.