Confluent, Kafka, Druid, and Flink: The Future of Streaming Data with Kai Waehner

Aug 08, 2023
Reena Leone
 

Apache Kafka® is a streaming platform that can handle large-scale, real-time data streams reliably. It’s used for real-time data pipelines, event sourcing, log aggregation, stream processing, and building analytics applications. Apache® Druid is a database designed to provide fast, interactive, and scalable analytics on time-series and event-based data, empowering organizations to derive insights, monitor real-time metrics, and build analytics applications. Naturally, these two things just go together and are often both key parts of a company’s data architecture. Confluent is one of those companies. On this episode, Kai Waehner, Field CTO at Confluent walks us through how they use Kafka and Druid together, where Apache Flink fits into the mix and shares insights and trends from the world of data streaming.

Dive into the world of data streaming and analytics applications with Kai Waehner, the field CTO at Confluent. We discuss the pivotal role of Apache Kafka and Apache Druid in revolutionizing the industry. As a leader in event streaming solutions, Confluent leverages Kafka as the foundation of their platform, augmenting it with data governance, security, and support services. As a Druid user and Imply customer and partner, Confluent uses Druid for both external applications as well as internal, including their observability platform. Built on Druid, it provides essential usage data for cloud billing and automates control plane workflows. Externally, they utilize Apache Druid to create robust monitoring dashboards and APIs for customers to track their infrastructure efficiently. 

In addition, we explore the synergy of Kafka, Druid, and Apache Flink in real-time analytics, with Kafka as the data hub, Flink for stream processing, and Druid excelling in time series data analysis. This episode also breaks down the paradigm shift towards real-time data processing and the growing importance of a cloud-first strategy in business logic prioritization and infrastructure management, ultimately highlighting the paramount importance of event streaming platforms and real-time analytics in driving innovation across industries.

Listen to this episode to learn more about:

  • The use of Apache Druid and Apache Kafka in Confluent’s data streaming platform.
  • The various applications of Apache Druid within Confluent, including monitoring dashboards, data lineage, and infrastructure monitoring.
  • The benefits of choosing Apache Druid for handling large-scale data growth and achieving real-time observability at scale.
  • Real-world use cases of Kafka and Druid together, such as condition monitoring and predictive maintenance in manufacturing and network monitoring in the telecommunications industry.
  • The complementary nature of Apache Flink and Apache Druid in real-time analytics use cases, with Flink being strong in stream processing and preprocessing data, while Druid excels in analyzing time series data.

Learn more

About the Guest

Kai Waehner is the Field CTO at Confluent. His current work focuses on the open source project Apache Kafka to build mission-critical, scalable event streaming infrastructures for tech giants, modern internet startups and traditional enterprises.

He is an technology evangelist who cultivates key customer and partner relationships, is a frequent keynote speaker at events, and prolific thought leadership author. Kai is a trusted advisor to both existing and potential customers and partners. 

His main area of expertise lies within the fields of Big Data, Advanced Analytics, Machine Learning, Deep Learning, Integration, Microservices, BPM, Cloud / Hybrid Architectures, Internet of Things, Industrial IoT, Blockchain, Augmented Reality, and Programming Languages such as Java, Scala, Groovy, Go, Python and R.
Kai regularly writes about new technologies, articles and conference talks on his personal blog as well as for Confluent.

Transcript

[00:00:00.330] – Reena Leone

Welcome to Tales of Scale, a podcast that cracks open the world of analytics projects. I’m your host Reena from Imply and I’m here to bring you stories from developers doing cool things with Apache Druid, real time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. And today on the show we are talking about Confluent, a complete event streaming platform and fully managed Apache Kafka service. They are also founded by the same folks who created Kafka. And in case you didn’t know, Apache Kafka is a streaming platform that can handle large scale real time data streams reliably. It’s used for things like real time data pipelines, event sourcing, log aggregation, stream processing, and building analytics applications. And if you’ve listened to this show before, you know that Apache Druid is a database designed to provide fast, interactive and scalable analytics on time series and event based data. And it lets organizations derive insights, monitor real time metrics, and build analytics applications. So naturally, these two things go together and are often key parts of a company’s architecture. Confluent is one of those companies who uses Druid and Kafka together. To walk us through how they use Druid and Kafka and share insights and trends from the world of data streaming.

[00:01:13.710] – Reena Leone

I’m joined by Kai Waehner, field CTO at Confluent. Kai, welcome to the show.

[00:01:19.030] – Kai Waehner

Hey Reena, great to be here.

[00:01:20.830] – Reena Leone

So I like to start off every show with a little bit about my guests and who they are. And you have had quite a career as a technology evangelist. Can you give me a little bit of a synopsis of your journey and how you got to where you are today?

[00:01:33.260] – Kai Waehner

Yeah, absolutely. So after the university I started as an independent consultant. I already loved taking a look at many different technologies at the same time and work with many different customers. And this really continued over the last, I guess it’s now 15 years or so. And then after the consulting, I changed to work for software vendor and then in all the different roles and with different tasks like presales, consulting, marketing, and that’s still what I’m doing today. The main focus is customer facing. So I talk to a lot of customers, but also to partners, to researchers. And this in combination with also doing blog posts and podcasts and presentations, this is really a good mix of doing a lot of different things. And so I’m not the deepest engineer knowing technologies like Druid or Kafka on the API level, very deep, but I know a broad spectrum of the use cases and architectures and this is what I’m discussing with customers or with partners. This is my daily business and I still enjoy it a lot.

[00:02:29.900] – Reena Leone

That is so cool. I feel like I have kind of just started down that path. So I’m really excited to talk to you today. So let’s kind of dive in. Confluent was founded by the creators of Apache Kafka. So obviously that’s a key component of your data architecture. But let’s quickly explain what Confluent’s offerings are and kind of what the business model is.

[00:02:51.510] – Kai Waehner

Yes, so this is really important and this is also where things often don’t understand the details. Right? So first of all, yes, Apache Kafka is the foundation of our platform and Confluent was founded by the inventors of Apache Kafka. They got venture capital and then a few years later, now we are public company, but we provide in the end much more than just Kafka. So we do data streaming and that is Kafka at its core for messaging and storage and data integration. But so many things around that most of our customers need to answer questions around data governance, security, mission critical support, best practices and so much more. So it’s really a much broader spectrum than just talking about Kafka, but it’s the entire data streaming spectrum, end to end, to build the project successfully. And with that, our business model in the end is that we have on a very high level two products. We have Confluent Cloud, which is our serverless offering where it’s fully managed and fully supported by us. And we have Confluent platform which is in the end the self managed offering which you can deploy everywhere in your own cloud VPC, or on premise or even at the edge, like in a retail store or in a factory.

[00:04:01.750] – Kai Waehner

And on top of that, you can buy mission critical support and consulting for helping you with the projects. This is, I think, really important in the beginning to explain that it’s so much more than just a hosted Kafka service and that’s in the end, our business model is being the best company in data streaming. And I think we’re doing a pretty good job here being the leaders.

[00:04:22.060] – Reena Leone

I would definitely say you’re doing a pretty good job. I mean, I saw footage from Current last year, Current’s coming up this year. You do a lot of events and you have a very strong community around Kafka and what Confluent is doing. But you’re here today because Confluent also uses Druid for external applications and I think internal as well. So can we talk a little bit about what Confluent is doing with Kafka and Druid?

[00:04:49.040] – Kai Waehner

Yes, absolutely. And it’s right. We are using Druid as a time series database for several different applications. So let’s first start with the public or external APIs and products where we use it. So it’s not like the end user or customer directly sees a Druid API. So we use it under the hood to build the APIs and products. And let’s just explain on a few examples. So, on the one side, again, so we have Confluent Cloud which is our fully managed offering of data streaming with Kafka, with connectors, with security and so on. And to the customer side, however, the customer also needs monitoring dashboards. The customer needs matrix APIs so that they can really monitor what’s going on, because it’s great that we provide the service offering, but still the customer needs to know what’s going on so that they can also double check if everything is working well. And these are some of the products where the customer can directly integrate these APIs into their own products. Like, this could be something like another cloud service like Datadog, or this could be a self built dashboard. But we provide these kind of APIs that you can use to monitor the infrastructure that the customer uses.

[00:05:58.870] – Kai Waehner

So this is one point. Another point is really where the customer uses products. And this is, for example, a Confluent stream lineage. So in the end, this is a product for data lineage where you have end to end visibility into how the data is flowing through the system end to end. So, for example, from one Kafka producer going through the system maybe also replicated to another region and then consumed by one or more consumers and guaranteeing that this data is flowing the right way. And we provide stream lineage which is available as API, because we do everything API first, but also with great UIs and visualizations that you can use as an operations or monitoring team. And so this is another product we sell in the end as part of our Data Governance suite, where the core foundation is free. And then you can buy more advanced products on top of that. And then there’s a third external suite which I want to mention, which is Confluent Health+. So this is also where we have a joint success story, right, which is publicly available, where we can read more details. And that in contrary to the other examples which are part of our serverless cloud offering, Health+ in the end is used to monitor Confluent platform.

[00:07:09.920] – Kai Waehner

So again, Confluent platform is the data streaming infrastructure the customer deploys in its own VPC or on premise or at the edge. But still we help them with alerting and notifications so that we help them monitoring their own infrastructure for data streaming because we have all the expertise and best practices. So Health+ is running our cloud offering using the Apache Druid implementation under the hood to monitor the on premise deployments of Confluent platform. And so these are a few different applications where Apache Druid is used under the hood to provide external APIs and products to our customers.

[00:07:47.530] – Reena Leone

So I’m seeing kind of a key theme here around observability and operational visibility. And I know internally, I believe Confluent built an internal facing observability application on Druid. Can you talk me through that one a little bit?

[00:08:03.320] – Kai Waehner

Yeah, absolutely. So this is really, on the one side, the customer facing applications are one thing, but on the other side, of course, we need to monitor our infrastructure. And actually, as you mentioned, a key theme is observability. And I think this is definitely one of the sweet spots of Apache Druid and why it’s so successful. Right, so inside Confluent, we use Druid to provide usage data for cloud billing and to perform atop diagnostic queries. So this is obviously now based on big data sets because we operate a lot of clusters for our customers and the insights from Druid are also used internally with an automated control plane workflows like Kafka cluster shrink and expansion. So this is really important because again, our service is not just hosted Kafka, but it’s really an automated service where all these things are executed with DevOps and these kind of principles. And this is only possible if you have the observability and monitoring in place for big data sets in real time and you can take action on that, which is also normally scripted because for thousands of clusters you cannot do that manually. But this is really how we also use Apache Druid under the hood.

[00:09:11.570] – Kai Waehner

And maybe for the show notes, we can also add a blog post where some of our engineers went into much more detail how this implementation looks from an architecture and IT perspective.

[00:09:21.190] – Reena Leone

Yes, that blog post is fantastic where they really kind of dive into how Druid is being implemented at a much more technical level. I mean, we kind of talked about it a little bit like through these examples of being able to handle large scale data and multiple clusters. But what made Confluent choose Druid? Were there any other databases that were evaluated? Was there challenges that you were going through that made you look for a new database?

[00:09:50.290] – Kai Waehner

Yeah, absolutely. I mean, there are always good reasons why you choose such a technology. And this is by the way, also what in the blog post we just mentioned and we will link and that’s where it’s explained in much more detail. Well, but on a high level in the end, the story looks like so often right when you’re a startup like Confluent, several years ago when we started building our cloud product. Well, the first thing is we choose a NoSQL database. And a NoSQL database is great for many different problems. And in the beginning, maybe also for analyzing timeshare’s database. But with our growing customer base and the scale of the data, we had challenges there so we could not keep up with the data growth and solving these problems, for really having the observability in real time, even at scale. And this is in the end why the engineering team had to reevaluate. And with that, they took a look at different databases or platforms and in the end the evaluation concluded with that Apache Druid is the right solution for that. And as you can hear today and also seeing these other blog posts and case studies, I think it was the right decision because it still works very well now, even at our much, much bigger scale today, compared to maybe a few years ago where we started with Druid.

[00:10:59.880] – Reena Leone

I don’t know what came first, but I’m sure that native Kafka integration didn’t hurt.

[00:11:05.110] – Kai Waehner

Yeah, that’s right.

[00:11:06.470] – Reena Leone

So does the way that you deal with all this data help drive product development for Confluent? Because you did mention that you’re dealing with very large scales of data coming in.

[00:11:17.320] – Kai Waehner

Yeah, exactly. And so there’s always these different perspectives. And in the meantime, we have a big company, right? We have many different engineering and application teams. And on the one side we talked about that with the different products we even provide for our customers. But on the other side, also for internal product development, you always need to know what’s going on. And this is really the difference between just a product where you provide what’s often called cloud washing, where you just deploy a few servers in the cloud, right? And then you run them and then you make a few decisions, like adding a broker at some point in time. Really, if you want to do this more in a cloud native way, which means it’s elastic. And you can also handle a multitenant infrastructure. And this is really critical, not so much for the customer all the time, but also for their own business model to have some margins so that you can also earn money over time. And this is only possible if you know what’s going on in your infrastructure. And so for our product development of our internal infrastructure, it’s key that we know what’s going on end to end and have observability.

[00:12:20.670] – Kai Waehner

And this is in the end. So why we build a data driven pipeline and here is where Apache Druid is a key piece of that, so that we don’t need to do human driven decisions. Because again, in the beginning, when you start a new cloud service, this might work well for the first ten customers. But now, where we have thousands of customers, you need to automate this. And this is exactly where Druid helps.

[00:12:42.090] – Reena Leone

Would you say that high cardinality is one of your top priorities for a database? I see that come up a lot with Druid as well.

[00:12:49.460] – Kai Waehner

Yeah, absolutely. I think this is also one of these kind of questions if a normal traditional SQL or NoSQL database works well, or if you should choose a database which has sweet spots like that. And it’s the same for us. So this is also, of course, on the other side, the reason, I mean, we don’t use Apache Druid for every application we do, right? We choose it for the right applications. And that makes total sense. And with that, the answer is always it depends on the use case. For some use cases you have for high cardinality, for others you don’t. And I can definitely say that for several workloads when you do things like log analytics for having an end to end observability, then you have a medium or high cardinality where a traditional, let’s say, Oracle or MySQL database is maybe not the best option because either it doesn’t scale that well or it simply doesn’t provide the right features for analytics. And so for these kind of use cases, definitely Apache Druid for us has some sweet spots here.

[00:13:46.890] – Reena Leone

So in the last year 2022, Druid Summit keynote, Matt Armstrong, also from Confluent, said something really interesting. And as a thought leader in the space, I kind of wanted to get your take on it. He said “Druid has an unconstrained future”, which I feel like is pretty high praise for any technology. Do you agree?

[00:14:08.380] – Kai Waehner

Yeah, I mean, it also seems that our engineering team really loves what they’re doing with the technology, right? So from my perspective, I’m a little bit higher level, right? I’m more customer facing and not so much on the engineering side, but I think the trends are clear everywhere in our internal development and organization, but also customer facing and what it from our customers, what they build. And if we think about a few trends in the market, right, like the cloud adoption in the meantime, it’s really insane. So even in very traditional industries like insurance or financial services, so many of our customers have a cloud first strategy. This doesn’t mean that they migrate away from their mainframe tomorrow, but new applications are very often built in the cloud. And in parallel to that, because they need to innovate and have elastic scale, they automate things, they use DevOps and similar principles. And with all this in mind from a technology side, also the customer expectations are changing because we have digitalization and everybody has smartphones and location based services. And these are, I think, the reasons why technologies like Druid have a great future, because there is more and more demand to have the end to end observability in real time.

[00:15:23.530] – Kai Waehner

I mean, this is also part of our story with data in motion, with using data streaming for building applications and in the same way for the monitoring, because you can only provide a great customer experience if you know what’s going on. And this is why I think more and more of our customers like us, as a software, as a service company, go more and more into analytics with real time observability and therefore yeah, I fully agree. And also the data is growing everywhere, right? And with that, a traditional solution doesn’t work anymore. So people are adopting technologies, like through it more and more.

[00:15:57.680] – Reena Leone

That’s what I’ve been saying on this show. Even if you’re not dealing with petabytes of data on that scale now, you probably will be in the near future. Data is only going to continue to grow. But actually, that’s a pretty good segue into my next question for you, because you are an expert in big data and analytics and as you mentioned, you get to be out there talking to folks about time series analytics and streaming data. Now, we kind of talked about how Confluent is using Druid but in combination with Kafka. But are there any other real world use cases that you’ve seen with folks using Kafka and Druid together?

[00:16:37.300] – Kai Waehner

Yeah, absolutely. And first of all, the interesting thing is we see this across different industries, right? So in general, people are talking often about Internet of things and sensor or IoT data, which is obvious for time series analytics because it’s in the end interfaces that continuously generate data. And this is of course perfect for analyzing the data. And this is also we will talk later about that a little bit. But this is where the sweet spot is a time series database and not the stream processor for many use cases. And therefore we see, for example, many customers in manufacturing that implement a solution for condition monitoring and predictive maintenance with data streaming as the data hub for integrating and preprocessing the data, but then ingesting it into a true database where you can do the real time analytics on these data sets to find issues and to monitor the smart factory, for example. So we see that a lot. And just to give you also a completely different example in the telco space, that’s actually one of the most famous Kafka and Druid stories I know and I see in the field a lot is from Swisscom, so Swisscom in Switzerland, so they’re talking a lot about that in the public.

[00:17:52.780] – Kai Waehner

And maybe we can also add another link to the show notes where they build in the end all their network monitoring with Kafka and Druid. Why do they do that? Well, because on the one side it’s high volumes of data and on the other side they need to handle that in real time to do incident management, to do root cause analysis more or less in real time. And these are great examples where it only works if you process the data with a scalable platform which is also capable of processing in real time. And I think these are the sweet spots of Kafka and of Druid together and this is why so many people are adopting it for these kind of use cases.

[00:18:29.820] – Reena Leone

We were talking about IoT and manufacturing in a couple of episodes ago and that it’s not just like a nice to have, it can actually be a super life or death, really important thing when you’re dealing with safety and protocols and getting that real time data so you understand if something goes wrong. It’s not just when we deal with say, software or maybe there’s some kind of hacker attack. These can be people’s lives. This is like really important stuff.

[00:18:59.090] – Kai Waehner

Yeah, absolutely.

[00:19:00.450] – Reena Leone

I’d actually like to shift gears and talk a little bit about another technology that kind of gets in the trifecta of streaming data, which is Apache Flink. Talking to folks about their data architecture, it’s Kafka, Druid and Flink – that kind of seems to be like the way to go for streaming. Do you think that Flink and Druid complement each other in other real time analytics use cases?

[00:19:25.510] – Kai Waehner

Oh yeah, absolutely. So, first of all, indeed, what we see is that really Apache Flink is growing these days. Like Kafka grew four years ago, we see this from several different statistics from the community and from the open source world, and also from the adoption of our customer base. So we expect a similar growth like what Kafka is today with Flink also in a few years. And this is also, by the way, also at Confluent, we strategically invested into Flink in our cloud service. We acquired a company called Imarok a few months ago. And with that in the future, we have the same expertise and cloud products like for Kafka today, also for Flink tomorrow. And with that in mind now I really also want to explain why this is definitely complementary and not competitive. However, before saying that, I really want to emphasize starting with discussion with Kafka here, because Kafka, in the end, most people use that as a data hub, whatever you call it, from a technical messaging platform, ingestion layer, streaming platform, whatever. But the real sweet spot of Kafka is that you truly decouple systems, no matter if it’s big data or if it’s transactional data, each application, both on the producer and consumer side, can use their own API, their own technology, or their own software as a service.

[00:20:43.310] – Kai Waehner

This is why it’s very, very often the foundation of a microservice architecture or a data mesh. And with this in mind now you can choose per application or use case what other applications you combine with it. And so on the one side you can do stream processing. And stream processing can be done with Kafka streams or with Apache Flink, where Apache Flink has a lot of adoption for many use cases. But the sweet spot here is in the end, you continuously process data, and very often for stateful stream processing, for example, where you create a sliding window and then say, continuously monitor the events from a specific interface. And this is where preprocessing with Flink, it’s very strong doing that with many different capabilities. And you can do that for more technical applications like streaming ETL. So for example, even as a preprocessor for other applications like for time series, or you can also build real business applications with it, like a payment application or a fraud prevention application. So this is the stream processing data in motion while the data is flowing through the streaming platform. On the other side, with Apache Druid, which is obviously a time series database, the sweet spot is, as we discussed earlier in the session, is about analyzing the time series data for monitoring, for observability and similar scenarios.

[00:22:09.010] – Kai Waehner

And with that, Druid has very different capabilities for analyzing data. And therefore, in most cases, and this is really how I always recommend it, don’t start with the technology, right? So you should start with the business problem and then evaluate. We are not Druid experts, right? We’re not selling Druid from a Confluent perspective. And this is why. Also, sometimes the situation is like a customer has a problem, we say, hey, you can still continue using Kafka as a data hub. But for this specific problem, you shouldn’t even try to implement this with Kafka screens or with Flink, because this is where Apache Druid with time series analytics is the right technology. And this is why we then also have the fully managed integration, for example with Imply, so that you can get the data out of the box into your systems and use the right technology for the problem. And this is how we see these different applications complementary to each other.

[00:23:04.670] – Reena Leone

You bring up a really good point. And I was going through your blog post, you have a fantastic website which I will also link to with some really fascinating articles. And you said that data streaming is not a race and that event driven architectures and technologies like Kafka and Flink require a mind shift in architecting, developing and deploying and monitoring applications. And I agree with that statement because I feel like a lot of Druid users are maybe using more batch data than streaming data at the moment, that they haven’t really reframed around that just yet, at least to start. And streaming is like a slice of the larger data pie, but it’s obviously growing but not quite there yet. What do we need to get there? What does that shift into streaming data look like?

[00:23:57.950] – Kai Waehner

Yeah, first of all, it really depends also on the use case and company, right? But for most companies today, it’s a huge paradigm shift because people think about databases like Oracle or MySQL where they store data at Rest, or maybe in a data warehouse or a data lake with bigger data sets, and then they use processing capabilities like a SQL interface to query data, or a Rest API.  In contrary to that, with data streaming, you really process data in motion, no matter where it’s coming from and where it’s going. And this is a completely different way of thinking about how to use data. And this is hard if you have not learned this from the beginning. And it’s even more challenging because not everything is or will be real time, as you said, even in the future, for some reports, you can still use a business intelligence tool, right? So it’s a really hard problem. And therefore, again, to emphasize why people use Kafka, it’s not just for real time data and therefore it’s not just a messaging queue, it’s really also for providing data consistency across real time and non real time layers. And this is now where this really also shifts in solving this problem.

[00:25:05.120] – Kai Waehner

Because even if our customers still have data coming from some more legacy systems that are not streaming data, you can still get them through the event streaming platform into the analytics platform. And so this is what we see a lot. And at Confluent we call this the maturity model we see at our customers, where we have five phases from a pilot project and then in the end state motor central nervous system where everything is going through Kafka. But this is really a step by step approach where you choose when to do this kind of shift for some of the applications and in most of the enterprise architectures. For some use cases you get it already into a real time application. For example, with Apache Druid, because you have the business requirement. In other cases you can still do batch or you can still do request response, the freedom of choice because Kafka decouples the systems. This is really why we see more and more adoption. But it takes time. And the great news is the technologies are ready, right? So Apache Druid is scalable like Kafka and like Flink. And so it’s a journey. We help the customers, but they need to decide on their own pace to find the right problems for that.

[00:26:14.060] – Kai Waehner

So it’s not the right thing to say we use Apache, Flink or Druid for every problem we have. No, you take a look at the business problem first and then you choose Druid or Flink where it can help solve the problem.

[00:26:25.600] – Reena Leone

We’re talking about the future and shifts and how the industry works. But as someone who is out there talking to people and looking at trends in the data space, we’re halfway through 2023 already. I don’t know where this year went, and I want to do maybe a little mid year check in with you on streaming data trends. First of all, what predictions have you seen come to fruition?

[00:26:51.820] – Kai Waehner

Yeah, that’s a great question. And it’s funny because I always write my predictions at the end of the year for next year and half year ago. I’m always predicted or I guess or expected a few things. And the two that really came true already is number one, data governance is one of the most important things. That’s obviously also if you take a look at our products, why we invest so heavily into products like Data Lineage, which is built on Druid under the hood, as we talked about, because customers need to have end to end visibility into their data flows no matter what technologies and applications they use under the hood. And that’s really crucial. And in combination with that, however, while people talked a lot about independent services using microservices, today we all talk about data mesh for building data products and focusing on business problems, which is great, but still with the data mesh in mind, the end to end observability is one of the biggest problems in the enterprise architecture. And on the other side, however, and this is related to that. Still, the different business units choose their own technology to solve a problem for different reasons, and there is not a single technology that can solve every problem.

[00:28:01.630] – Kai Waehner

We have discussed this with Flink versus Druid, for example. And therefore, one thing that I think a prediction from some people that is not coming true is that the so called lake house takes over, right? Because a lake house in the end is the combination now of a data warehouse and of a data lake. And then some vendors tell you, we build your lake house and you can do everything with our technology. And I’m not a fan of that because I prefer this kind of domain driven design where you choose the right technology for a problem and then you can share data with others, but the others can use another technology for. So this combination of thinking that it’s really more a data mesh where the business units choose the right technology and then they share data with Kafka or a similar technology. I think this is really where we are going because this is also how you’re flexible and can go to time to market with a new product easily. Because you can choose the best product and you don’t buy one product or cloud service for every problem you have.

[00:28:58.940] – Reena Leone

I mean, not to repeatedly use this one literary reference, but we say, Druid, there’s not one database to rule them all. Yeah, it’s like kind of the same principle. Okay, so, continuing with trends. Have any trends in streaming surprised you this year?

[00:29:16.760] – Kai Waehner

Yeah, I would say there is two things. On the one side, I was really surprised when I actually saw the stats about Flink adoption in the last quarter. So, I mean, we made a decision about strategically investing into Flink some time ago, even before the acquisition of Immerok, of course, but we saw a huge adoption. And so still, Flink is not the best pre processing solution for every problem, right? But for many use cases, Flink has a lot of sweet spots, and especially if you can leverage it as a fully managed cloud service so that you don’t have to operate it, which is really hard for Flink, and that’s a problem for many people. So that’s the one surprise with streaming. So that while there is really tens of different stream processing frameworks and cloud service on the market, in the meantime, I see the trend that Flink is taking over here for many, many use cases. Not for everything, but for many. And the second, which is a little bit more generic trend that really surprised me in the last, let’s say twelve to 24 months, is that almost all of our customers now have a cloud first strategy.

[00:30:21.100] – Kai Waehner

And this is really across almost all industries. And so even the traditional companies, like banks and insurance companies, they still operate mainframes and some invest strategically into them. But for new use cases, most of them build at least parts of the applications in the cloud because they want to focus on business logic and not on operating the infrastructure. And this is really the most surprising trend, not that the cloud is getting more and more successful, but that even these kind of regulated markets are getting more and more into the cloud. And in the meantime, we have customers that are 100% cloud, even for regulated markets, like if they build a payment app, for example. And this is really exciting to me, and it’s still a long journey for many companies that exist today because it’s a hybrid story, but it’s fine, right? But this adoption of the cloud first strategy, this is really, in my opinion, insane. And it’s a great thing because we will see much more innovation this way.

[00:31:14.810] – Reena Leone

That’s awesome. I felt like there was a point maybe in more recent history where I felt like there was almost a shift more onto on prem, but I’m glad that that has kind of gone away now and that we’re going more cloud first to your point, because it’s fantastic for innovation.

[00:31:33.070] – Kai Waehner

Yeah, absolutely. I mean, to be also fair, you’re right. I think not everything will go to the cloud. And there are three reasons why people also deploy on prem or at the Edge with data streaming. It’s for security, it’s for latency, it’s for cost. Reasons that you preprocess at the Edge, at least. So these hybrid options are as normal in the future as multi cloud scenarios. We are ready for all of that and then the customer can decide for their use cases.

[00:31:56.360] – Reena Leone

Awesome. Well, Kai, thank you so much for joining me today. If you want to know more about Confluent and what they do, please visit https://www.confluent.io/ If you want to know more about Apache Druid and how it integrates with Kafka, please visit druid.Apache.org. And if you’d like to know more about Imply, please visit imply.io. Until next time, keep it real.

Let us help with your analytics apps

Request a Demo