Jun 29, 2022
When Streaming Analytics… Isn’t
Streams are the new normal. When you talk or text on a mobile phone, it’s transmitted as a data stream. Everything on the Internet is transmitted as a data stream. But not all data is yet in streams.
The mainframe era was built around the idea of batch processing, and there is a lot of batch still with us. As the name suggests, batch pulls chunks of data together and processes them as a group. Have you ever wondered why a check deposited at a bank after 2pm doesn’t show up in your account until the next day? It’s a relic when bank mainframes could only process a single batch per day, which would begin at 2pm.
Many transactions have moved to microbatches, with batch processing running every few seconds or even faster, but the needs of Internet scale applications need every event to have both immediate processing and reliable delivery, data streams are becoming the default technology to move data. Apache Kafka™ has emerged as the preferred event store and stream processor, challenged by Amazon Kinesis and Azure Event Hub.
Whenever there are transactions, there is a need for analytics! With a few exceptions, nearly all databases are designed for batch processing, which leaves three options for stream analytics.
One option is to treat everything as database tables. You “land” the streams into a file, then ingest the file as a batch. This is the technique used by batch analytics databases such as Snowflake, Amazon Redshift, Teradata Vantage, and Google BigQuery. There’s a good article on this technique by Avi Paul on Medium.
This works, if you’re willing to install and configure connectors and live with the latency of waiting for the land-batch-ingest cycle. Often, there is also a challenge of ensuring that all events end up in the database exactly once, as noted in this Github discussion of Snowflake’s Kafka connector.
Batch processing is great for reporting, especially long-running heavyweight queries. It isn’t stream analytics – it lacks rapid ingestion and exactly-once semantics.
Another option is to treat everything as a stream. Confluent offers ksqlDB to analyze Kafka topics and Amazon offers Kinesis Analytics to, unsurprisingly, analyze Kinesis streams. These tools allow queries directly against streams, using SQL to treat each stream as a table and each header as a column.
Lots of developers love ksqlDB! It’s easy to use, and is probably the easiest way to quickly use SQL to extract information from Kafka. It’s also an easy way to enrich and enhance data in Kafka, at a much lower price than using tools based on Apache Spark.
While stream processing is great for simple analytics and for enriching stream data, it isn’t stream analytics. Kafka and Kinesis are powerful tools, but they aren’t databases, and lack data checkpointing, interactive data conversations, and the ability to easily compare real-time streams with historical data.
For example, Natural Intelligence is the creator of xMatch, technology that matches high-intent buyers with leading brands, providing over 45 million customer referrals each year. To power xMatch, data streams from many partners are ingested, enriched and enhanced, then compared to a large database of historical data. Over 60 processes are involved in ingesting, enriching, and making the data ready for analysis.
The team at Natural Intelligence tried to do this with stream processing … but it didn’t work at their scale. Like others who need stream analytics, they require an ability to combine batch and stream, along with high concurrency and subsecond query response.
Fortunately, there is a third option: real-time analytics databases. Starting around 2010, separate teams around the world developed new databases, designed to combine intensive stream data with historical batch data, while enabling subsecond performance for queries of large data sets. The first of these new databases to become available for public use is Apache® Druid, which at its initial release was able to ingest a billion events in under a minute and query them in under a second – and now provides subsecond response on multi-Petabyte stream + batch data sets.
Druid provides stream analytics and batch analytics together, while enabling high-concurrency interactive conversations with data at scale. It’s leading the next generation of modern analytics applications, using data to drive actions.
Natural Intelligence is just one example, now able to combine many real-time streams and Petabyte-scale historical batch data into a single Druid database, which they query with subsecond response and drive meaningful, high-performing traffic at scale.
Streams are the new normal. Stream analytics is the next normal. If you are a data professional, you need to understand stream analytics.
If you are building a modern analytics app, or you just want to learn more about stream analytics, take a look at this whitepaper on Druid Architecture and Concepts. If you want to try it for yourself, you can get a free 30-day trial of Druid-as-a-Service from Imply Polaris.
© 2022 Imply. All rights reserved. Imply and the Imply logo, are trademarks of Imply Data, Inc. in the U.S. and/or other countries. Apache Kafka, Apache Druid, Druid and the Druid logo are either registered trademarks or trademarks of the Apache Software Foundation in the USA and/or other countries. All other marks and logos are the property of their respective owners.