How Druid Facilitates Real-Time Analytics for Mass Transit

Sep 21, 2023
William To

In a world where both urban populations and average temperatures are on the rise, mass transit has become more important than ever. Public transportation, in the form of subways, trains, ferries, and buses, can help reduce car usage, cutting down on congestion and pollution. Ultimately, mass transit should provide a more livable, walkable experience for residents and commuters alike.

In order to optimize the performance of existing mass transit systems (and lay the groundwork for new ones), authorities need accurate, real-time analytics. Examining and consolidating transit data provides valuable insights into usage patterns, revenue streams, and service effectiveness, enabling authorities to adapt schedules, assess route efficiency, and adjust staffing and fares.

Alongside performance monitoring, IoT devices for mass transit can also be used for predictive maintenance, signal control, and passenger communication. Leveraged effectively, IoT analytics can enhance the reliability, safety, and efficiency of public transportation systems, benefiting drivers, maintenance workers, and passengers alike.

A visualization of how IoT devices could be utilized for bus routes.

However, challenges remain. Data needs to be continuously available, especially for safety-critical functions like communication-based train control (CBTC) or emergency notifications. Outages will cripple transit, leave passengers out of the loop, and impact safety, rider trust, and agency reputations for the worse.

Further, when crises arise, multi-disciplinary teams (of engineers, analysts, and others) have to rapidly explore data in a versatile manner in order to discover the underlying cause of an issue (such as a signal breakdown) and respond accordingly. Speed at scale is vital, as even medium-sized mass transit systems may have tens of buses or subways making their way through hundreds of stops, with each vehicle generating many events each second.

Why Druid for public transportation analytics?

Scalability. Druid can elastically scale to keep pace with demand, accommodating peak rush hours and holiday travel as well as quiet, off-peak times (such as weekend mornings).

Availability. Mass transit rarely stops—some systems run lines twenty-four hours a day, seven days a week. Druid ensures that data will also be available, as any data interruptions can lead to travel delays or worse.

High performance under load. From applications retrieving subway arrival times to analysts slicing and dicing data to identify and resolve issues, Druid can execute queries in milliseconds—regardless of the rate of queries per second or the number of users.

Customer example: An analytics provider for mass transit

This company offers an intelligent analytics platform for over 100 public transit networks, serving cities like Miami, Oakland, and Philadelphia. Their solution provides transit planners with insights on performance and historical trends, ingesting and analyzing data from 25,000 vehicles and millions of passengers. Users can calculate the average number of early, on-time, and late arrivals or departures.

Previously, this company built their product on PostgreSQL. At first, data (in JSON format) was loaded into Amazon S3 buckets in batches, and were then run through a Python process to uncover GPS data. Afterwards, the data was ingested into PostgreSQL for storage and queries, and finally visualized with Superset.

Before and after architecture diagrams.

However, as the company’s user base grew, they found that PostgreSQL was both expensive (averaging almost $200,000 annually) and unworkable. PostgreSQL could not scale elastically, retrieve query results rapidly, or provide high availability. In addition, PostgreSQL was limited to 120 TB of data—and it was impossible to scale beyond this hard cap.

Unfortunately, this ceiling also made it impossible for this company’s sales teams to put together feature walkthroughs for potential customers. Because customers tended to sign multiyear contracts for services, they needed to see and experience the platform. This limitation ended up slowing sales and ultimately, the company’s growth.

PostgreSQL was also unstable and unreliable, creating downtime for applications. Without any means to monitor PostgreSQL, the company could not fix problems before they escalated (or even preempt issues); sometimes, they only learned about problems from customers.

As a result, the company switched to Imply’s distribution of Druid, using a single cluster to scale and serve nearly 500 mass transit organizations. This led to a sixfold decrease in operating costs from PostgreSQL—now, the team’s expenses were only about $30,000 a year.

The company also utilizes Imply Clarity for performance monitoring. Despite Druid’s high reliability, featuring its own deep storage layer for both scaling and continuous backup, this company preferred to have the ability to visualize the health and performance of their Druid clusters. Druid also facilitated several internal use cases, specifically proofs of concept (POC) demonstrations for sales teams, and allowed other company stakeholders to better address customer requirements. 

Another key benefit of choosing Druid was future proofing. This company plans to transition from batch to streaming data, delivered via Amazon Kinesis. Druid provides a built-in Kinesis indexing service that can read events in native Kinesis formats, such as shard and sequence numbers, while providing exactly-once ingestion for both data consistency and reliability. Today, customers like Lyft, Ibotta, and Twitch incorporate both Kinesis and Druid in their analytics stack. 

Other blogs you might find interesting

No records found...
Nov 14, 2024

Recap: Druid Summit 2024 – A Vibrant Community Shaping the Future of Data Analytics

In today’s fast-paced world, organizations rely on real-time analytics to make critical decisions. With millions of events streaming in per second, having an intuitive, high-speed data exploration tool to...

Learn More
Oct 29, 2024

Pivot by Imply: A High-Speed Data Exploration UI for Druid

In today’s fast-paced world, organizations rely on real-time analytics to make critical decisions. With millions of events streaming in per second, having an intuitive, high-speed data exploration tool to...

Learn More
Oct 22, 2024

Introducing Apache Druid® 31.0

We are excited to announce the release of Apache Druid 31.0. This release contains over 525 commits from 45 contributors.

Learn More

Let us help with your analytics apps

Request a Demo