Welcome to our webinar, Introduction to Imply Cloud Managed Apache Druid in your AWS. So now I'd like to introduce our presenter, Vadim Ogievetsky, is Co-Founder and Chief Product Officer here at Imply where... which was started in 2015. Prior to working in Imply, he was the user interface lead at MetaMarkets where he helped to coauthor Druid, that company was acquired by Snapchat. Before that he was working at Stanford where he specialized in data visualization and contributed to the database and D3.js projects.
So with that I'll hand the baton over to Vadim that will walk you through Druid, Imply, Imply Cloud and show you a pretty cool demo.
Thank you. Thank you so much for the introduction and hello everyone. Thank you for joining. So in this webinar I will give a brief introduction of Druid and then we'll talk about Imply Cloud, which is our managed service in AWS that lets you spin up and maintain, manage your Druid cluster with very little hassle.
So, first of all, who am I? I didn't realize that Rick is going to give me such a great introduction, but in short, my name is Vadim. I'm a Co-Founder and Chief Product Officer here and I've been working with big data and visualization stuff for over 10 years.
So what is Druid? If you go to the Druid.io website, you'll see that Apache Druid currently incubating, is a high performance real-time analytics database. I just want to break down this tagline because I think that really explains what Druid does pretty well.
So high performance, what that means is really just low query latency and massively multi-tenant. So you issue a query, it comes back straightaway with a very low latency. We call it sub-second trying to get queries back to you under a second. You can have a lot of people querying Druid at the same time supporting use cases where a single application is issuing many queries and also many people are querying the cluster at the same time with no performance degradation.
Real time, which just means that Druid works with streaming data. This is kind of what the world is moving to now, all your data lives in streams. You have your Kafka streams, Kinesis or what have you, and you can ingest that directly into Druid and that data becomes instantly available. So you can query it as soon as it comes in.
Analytics. Druid is database that specializes in counting, ranking group by queries. Druid is an analytics database and if you're thinking about this in the SQL world, Druid is really, really good at doing group buys and performing aggregations very quickly and including all the data up to the second in those aggregations.
Lastly, Druid is a database. So it's not a cache, it's not some query processor or a streaming analytics processor that stores just a small sliver of your data. Druid actually stores a copy of your historical data and data that just happened. Allows you to query over all of that data in a single query and supports a pretty normal SQL interface for actually querying it.
So the problem that we're trying to solve when we build Druid is we're really trying to power UI's like this. In this particular case, I have these gifs that demonstrate Pivot, which is Imply's UI that we built a couple of Druid and you'll see in some more because it's part of the things that you get inside of Imply Cloud offering.
But the fundamentally when people deploy Druid, they're trying to power some experience like this where people can take a big data stream or just a big collection of data and then to be able to surface it to users in a very quick and intuitive way where you can just have a dialogue with the data.
So every time you just driving something in the UI keeps querying under the hood and pretty much every product that people build with Druid looks like some variation of this. A lot of people use the Imply UI that is pictured here and some people choose to build their own UI or something like that. But fundamentally, it's the power and experience like this.
The key features of Druid that makes this possible is that Druid is column oriented. So pretty much any kind of analytics database these days is column oriented. It just means that you can have many columns, but Druid will only read the columns that are involved in the query to answer it. So if you make a query, like give me the top, let's say countries by revenue, it will only take the country and revenue columns and you might have hundreds of other columns and it won't even consider that.
High concurrency is the idea that Druid can support very high workloads from multiple users and from applications that are making multiple queries kind of all at the same time under one roof.
Druid is very scalable. So scalable to hundreds of servers, millions of messages per second. There are Druid clusters out there with over a thousand servers and Druid is very horizontally scalable. You'll see later why that is from the architecture, but essentially if you have a cluster and it's working well for you, and then you decide you want to put twice as much of a workload on it, but maintain executive same performance, you can just have a cluster that's twice as big and that would handle that doubled workload.
Druid indexes all dimensions by default. So you stream data in and Druid will automatically build indexes on all the dimensions. So any kind of dimension that you have, you can filter in it and you can expand on it. So anything that you put into Druid instantly becomes analyzable and groupable.
Druid supports a pretty standard SQL interface that is exposed in our UIs and you can basically query Druid with SQL. You can use a JDBC connector that trips with Druid. If you have an application that is powered by JDBC or that can accept a JDBC impact.
Then real time sub-second queries. So the main thing in Druid is optimized for queries that happen in less than a second. A lot of architectural decisions in Druid are really made to power this idea. So this means that everything has to be hot and live. We can't query anything that is slow as part of the query path. If you issue a query to Druid and the transfer is properly tuned, then within a second that query will return to you.
Lastly, Druid can blend real time and historical data in one place. So you can store years of historical data in Druid. You can tear it and have maybe data that's older stored on cheaper hardware and data that's less old or maybe data from a month ago store premium hardware and then real time data that's currently streaming in will be stored in its own place and the whole thing will be queryable in a single query. You just say what time range you want, including all that and Druid will performance that query.
The use cases that we see this kind of technology applying is click streams and user behavior. So if you host a website, if you host an app or something like that, then you want to see how people are interacting with this. So with your app and you collect the clicks from it, it's basically timed like a Google analytic style product, but once the companies build themselves and provide a very high level of enrichment. So the key strengths here is being able to segment the data however you want and also to enrich it with whatever attributes that make sense for you.
Digital advertising. That's actually the birth cradle of Druid. Druid was originally born to handle the high volumes of data that are present in digital advertising and ad server logs. You tend to have very high volume of data there and Druid can scale to that and also the data will come in to Druid in realtime so you can monitor the performance of your campaigns and see like if anything is wrong with the network right now.
Application performance monitoring or APM. Basically, if you have an application running on the server, maybe like you're deploying a large number of servers, then it's very nice to be able to monitor like our response times within thresholds or if they aren't maybe like what microservice is causing a slow down.
One of the things I will show up briefly in the demo is a product called Clarity, which is our monitoring service for Druid. That's basically an APM solution for Druid that we built that we provide for you to help manage a Druid cluster.
Network flows. So if you have a data center and you have a lot of machines in it and they're communicating to each other, then looking at the flows happening in the network can let you know if there's some bottleneck or some attack happening. Again, Druid is very popular in the space because you can arbitrarily say, "Okay, well I want to look at it by source IP, destination IP, poured, source and poured," split in segment the data however you want. Also, monitor the data in real time, of course, and again, aggregate a huge amount of data in one place.
Then, lastly IOT. So sensor readings on smart meters or any large complex system that works and has a bunch of sensors that are emitting data, that is a very good use case for Druid because you have a lot of this data being generated and again, you want to be able to see exactly what's going on right now, but also compare it to historical trends. So to see like, well, is the sensor reading from now like different from a sensor reading from some time ago?
Then all of these cases, one of the key things that we see people doing is enriching the amount of data that they're putting in. So having some enrichment service that adds more columns, some more information about the user or in case of network flows, the type of traffic that's happening and using that.
So just looking real quick, if you go to Druid/powered-by, you'll see a bunch of companies that are using Druid. There are a bunch of companies that are powered the polar analytics or in production today, big companies like there's Netflix that powers it for the user analytics. The clickstream analytics of all the people interacting with Netflix properties. Lyft that powers it for analyzing the routing algorithms and also the ride sharing experience and many, many other companies that use Druid for all these cases I'm talking about.
If you go to druid.io/druid-powered, you'll see little blurbs from all of them have like what actually it is that they're doing. You can go in there and play the game of classify the used case.
All right. So that was Druid and Druid is a really good project. We should go to Druid.io, check it out, read about the powered by. I want to just mention real quick what Imply does and where we come into this.
So Imply is a company with Druid and the co-founders of Imply were the developers of Druid and Imply still drives the majority of Druid development today. We're very much active in promoting the project and making sure that it's great and it meets people's needs. What we do at Imply is we try to provide a complete solution around Druid so that if Druid and the analytics that it offers is what you're looking for, you can pick up Imply and off the shelf, plug it in to do a lot of useful stuff.
Specifically we provide visualization in the form of a pivot app that I was showing [gifsolve 00:14:55] and you'll see it in a little bit in the demo. Then we provide tools to make sure you're running securely. So you don't have to worry about all the security aspects and meet all the compliance checklists that you have in your organization if you're applying Druid, as well as, enforce user level role based security on your data.
Most importantly, what this Webinar is all about, we help with management and operations. What we really want to do is make running a Druid cluster as easy and as simple as possible. I think today you'll see an example of what I believe is probably the simplest way to start a Druid cluster that's possible. You'll see the takes only one or two clicks. But what we want to empower you is being able to deploy Druid cluster, cron cluster for testing to be able to roll out new versions or configuration changes with no downtime and to be able to monitor and alert on the performance of your Druid customers so you can maintain your SLS, your customers.
So I want to talk a little bit about architecture and architecture of Druid and also architecture of Imply Cloud. I think this will explain a lot about how stuff gets deployed.
So this is a standard Imply architecture. This is what you would set out if you went to our site, went to the get started page and downloaded our tarball. Fundamentally, you have three types of servers that can come with Imply. You have the query server, the master server, and the data server.
The data server is the server that kind of does all the work. It does all the heavy lifting. Each one is basically as old database. It works with some deep storage layer. So a deep storage is work your data which Druid packages into things called segments are stored and the data service, when they come up, they pull the data that they want from... that they need from deep storage and then they serve it accordingly.
What you have here is the query servers. They serve the app, in this case. So this is the entire architecture. So we bundled pivot on the query servers and you can log in and start analyzing and exploring your data straight away from there. The query servers are... they will take quarries and they'll farm it out to the data servers. Then data servers again, as I said, they just received data from streaming or batch ingestion and they build segments and then they push them down to deep storage and other data servers load them as per the replication and configuration that however it's configured.
Then on the side here you have masters servers. You have three of them, I drew three just for higher availability. They form a quorum and they sit off to the side and then they just coordinate and make sure that the cluster is adequately balance. If a new data server comes out, they tell it what to load. They also tell the query servers which data server has what information. They do that with the aid of a metadata store that they store metadata about cluster information in sizing.
So this is what you would get if you just went to our site, downloaded a tarball, maybe try the quick start. In the quick start, all of these servers are running all locally within your computer, but obviously in a clustered setup you'd have them as different machines.
Now, the way Imply Cloud works is that... We call it, bring your own VPC. Imply Cloud is currently available, only within AWS. So you set up an AWS VPC and you link it to the Imply management VPC. What you then get is an interface to our console, which I'll be demoing very shortly right here. Inside of this interface you can then spin up clusters.
When you spin up a cluster, it will be spun up in your VPC and it will look basically the same architecture as what I was just showing for the on-prem thing. It's going to run completely in your servers and this is very important to us. You own the data, you own the servers and if you configure the right instance permission roles, you can association to them, play around with them however you want.
Obviously, if you make some changes, like our management service might potentially stomp over those changes. So we also provide an interface for you to configure what changes you want to make from our UI. For example, you can load your own custom extensions. This is something... a strength of Druid is that you can extend that however you want and we let you configure which extensions you want to load from your own S3 outfits.
For deep storage we will use S3 and we'll spin up an RDS for you as the metadata store and then we'll make sure that these clusters are optimally connected. Not only that, but when we spin up this cluster, we make sure that, for example, all of these links are TLS encrypted and the deep storage is encrypted and the RDS is encrypted. So we don't just spin up a random cluster, we spin up what I would call the gold standard cluster. It's basically, all the bells and whistles that you need.
I talked about this for quite a bit and I think it's about time to show a demo of how all of this connects together and what this actually look like. All right. I'm going to go to our cloud, the Imply.io. This is Imply's demo account inside of our own cloud. So this is me logging in as a cloud customer and this is what you would see if you are a cloud customer.
I have a lot of clusters configured here and a lot of them are in a stop state. So this is actually very useful. I can stop a cluster, save all the configuration for it, have nothing on my bill because nothing is actually running, but at any point be able to recreate that cluster with exactly the same configuration and it will remount the same deep storage, meaning that it will reload all the segments that it was serving.
I have two clusters that are running and right now, I'm going to go ahead and I'm going to create a new cluster. So if I wanted to spin up a new cluster, which will be the first thing I would do, when I come into this for the first time, I would have nothing here. This will just say, "Please, go and spin up a new cluster." I go here and all I have to do is really provide a cluster name. So old demo and then select a version from one of the versions emerges, pick the latest. Then all I have to do is really just pick what instances I want.
In this case, because I'm going to spin up this cluster, but I'm probably going to not really do much with it. I will shut it down shortly after. I'm going to choose pretty small instances just because I want to avoid... I'm not even going to make it highly available, which means that this cluster won't support rolling updates. But that's okay because I will show a rolling update from a different cluster.
After I selected all my instances, I could just go ahead and do create cluster. Before I do that, I want to just show off the advanced [inaudible 00:23:22] because what this allows you to do... one of the things we're really passionate about is that you are running this infrastructure. You have full control of it so I could provide a key pair name if I wanted to association to it. If I wanted to lose some custom files into the cluster for some reason for my extensions that would work, I could set futz around with the encryption and I can even override the configurations of each individual node directly.
Obviously, based on the instance types, the ideal configurations will be pushed out to the nodes. But if there's something I want to try out, maybe some experimental feature, maybe some extension that you're providing with some of its own configuration, you can override it here.
I'm going to go ahead and create this cluster now. So I'm going to confirm this. This was submitted requests. This cluster cloud demo is now starting and it's just going to spin here for a while, provision all the resources and within 10 or 15 minutes, this cluster will be up and running and along the way you will be able to see what's actually going on. So, there's some Cloud formation requests that were submitted and slowly it will kind of work through that. It will build a plan and then it will show me what the plan is.
Once that cluster is spun up, I can start using it. This is by far the easiest way to spin up a Druid cluster that I know of. I mean, I showed off a few more things, but if I wanted to really be aggressive about it, I could've just started cluster without configuring anything and I would have gotten there faster. So I always say it's quite a few clicks less than ordering Grubhub. So take that Grubhub and yeah. let's see.
Another thing that you can do pretty easily while the cluster is spinning up is I have a cluster here. It's running, it's nightly. It's running on a version that is slightly out of date. We don't use this cluster very much, but I'm going to go in and I'm going to manage it. So I can go here, I can go to the setup screen and I can see, okay, wow. I have... this is a pretty small cluster is actually not a highly available clusters. So I will not be able to perform a rolling update on it. But I will be able to still upgrade this version and I'm going to apply changes and in this case, because this cluster isn't highly available because [inaudible 00:26:13] the figuration I chose, I'm going to say, well, [inaudible 00:26:18] will be upgraded to inversion, but it will be starting in process.
Similarly, I have another cluster here. It's the stable cluster and it is running the latest Imply version. I can go in here and I can manage this cluster and I could maybe downgraded it. If I was to downgrade this cluster, I would be able to apply this without any service interruptions because of this configured to run in a highly available way. Specifically, it has three master servers so one of them can be taken down offline at any point.
I'll just quickly show off the other parts of this interface right here. So one important part that you... Well, let's discard these changes. Sorry, actually can see the specific servers that are running and by the way, as I mentioned, one of the things that is very useful to do with Druid especially when you're managing lots of historical data, is to have multiple tiers. Right here I can say I want three tiers and configure different machines and different number of machines through different tiers and then later go in and reassign what segments of what timeline, or what data sources are stored on what tier and monitor to make sure that that assignment is performed correctly. So I'm going to discard these changes.
Another thing that you get here, you get this with Imply, you get our interactive exploratory app built in and ready to go. You can access it by just clicking open here. Open is opening this cluster and letting you play around with the data visualization app. So right here I could... This cluster is running and I can drag some dimensions in and perform very immediate interactive queries. I can start building dashboards and I can share this out to all of my users.
Now, this has been automatically provisioned and configured to talk to this cluster. Obviously, you might also want to be able to interface with this cluster by yourself and for that we provide an API which you can access to the API screen that I was showing before.
Another thing that you can do here is you can actually load some new data. So we have a data loading flow. So if I pointed to some data, in this case, this data is in http, in an https server, I can sample it and get a sample of my data and then help build a schema on top of that. Druid has a high level schema that it uses. It needs to know what your time column is so it can partition the data correctly. This will help me configure that. It will also tell me if I have roll up or not. I could set up automatic compaction and kind of start ingesting, review the configuration for this and then start loading this data straight away.
So this is the idea that you can with Imply Cloud get started, spinning up a Druid cluster, wait for a bit for it to spin up, have everything provision perfectly with all of the security that you could want by default out of the box and then start loading data and playing with it, all within a span of a few minutes.
Going back to the cloud interface. I'm going to manage this cluster here. I can also, access the underlying Druid console of this cluster. So again, if I want to dive deep and look at the segments within this cluster or the current cluster, I mean, that's the cluster I just submitted or the data servers, everything is available for you here. You can configure the dynamic properties of the cluster, the look ups, everything is wired in and the security for the access to this interface is controlled by a set of permissions.
So in your environment, you will set up a set of roles with a set of permissions of who can access what, whether people can manage the cluster. In this case I have a super access permission so I can kind of see and manage everything.
Last, but not least, is we also automatically hooked this cluster up to Clarity which is our monitoring solution. So here I can basically go in and see, "Okay, what queries are happening on this cluster right now?" I see that I've been querying the Wikipedia data source and in this case the queries are pretty trivial discussed it doesn't get a lot of usage, but definitely is something that is useful to be able to monitor and diagnose the performance of a cluster.
I guess, lastly... or we can go back then we can see how that cluster I was creating is doing. So I'm going to look at this cluster and when I see that it's still creating. It has a few more messages here and the biggest thing it's going to wait for is the creation of the RDS instance. Then once that is configured, that cluster will be fully usable. So probably by the time we get to the end of the question and answer section of this Webinar or this will be up and running, right.
Most importantly, is even if a cluster is updating, even if since this cluster is updating, I can still monitor it because Clarity is something that we host. We kind of collect the metrics for this cluster and we would like you access them as much as you want. So if I monitor this cluster, I have the information for this cluster, I see my spiting Wikipedia traffic here and I can see what queries they have been making and also how the ingestion was behaving and what the server performance as well as any exceptions have very long.
So this provides a... Meanwhile, this cluster that is currently doing a rolling update, it's showing me there is an update in progress. I can see and kind of have an audit trail of all the changes that have been happening. So the space I have asked to upgrade from 291 to 297 and at the same time I can see that I have an update in progress that I can see what is the set of steps it's going to take. So I understand what it's doing.
So it's trying to be as transparent as possible about your operations. At the end of the day, you host the information, it's stored in your VPC and you have full access and full control over it. All of what I'm showing you here is just lots of nice to have tools to make your DevOps life easy and to let you focus on what's really important, which is loading your data and making sure that you can surface the data and actually help your digital business with the data you're collecting, not futzing around, setting up the encryption on a Druid cluster have that handled by the service.
So I think at this point I will pause the demo and maybe we'll see if there's any questions.
So the first one was, excuse me, is Imply free to use or is it paid? That question actually came in even before we started the Webinar. So is it... there's always, of course, dollar and cents question.
Yeah, absolutely. So Druid itself is a free and open source. Go grab it today from the Apache download site and you can use it. The Imply distribution, if you are running on-prem you can use it. It's free to use. The visual, the pivot application, the extra application is child limited, but otherwise one of the things that we shared there is a slightly Imply flavor distribution of Druid that I didn't go into much, but that's all free to use.
Then Imply Cloud is a paid service that if you are just at the user you should reach out and contact our sales team and you can run a POC and take it from there.
The second question is one more technical and is, is auto scaling possible over the Imply cluster management?
So if by auto scaling, you mean automatically scaling to meet demand, that's currently not supported in Imply Cloud. It's something that is on our roadmap. But what is possible and very easy if you want to scale out a cluster. So if you go here and I want to increase the number of instances, I can just increase it, apply changes and... Let's do it. Let's scale this cluster out a little bit. This will be a very quick migration because it's just... it will do it a defit seize of only used to set up one server and then it will just do it.
But it won't do it automatically as based on load, but with tools like Clarity and with tools like our capacity monitoring, you will be able to... you'll presumably want to keep an eye on your scale and if you acknowledge you have a use case coming up on board that will, let's say, triple the volume of data you're ingesting, you'll want to scale ahead of that. Or alternatively, for example, if you're just trying out a cluster, maybe you're just playing around with something and you're, "Well, I'm going home for the weekend. I don't want to have these machines running overnight." You can stop that cluster yourself, it's just one button click. Then as I said, all the state will be saved. So if you later want to recreate that exact cluster, you can do that with one click.
I'm not sure if there's enough context in this next asked question, but I'll ask it anyway. What happens if we reduced the incidence numbers, does that make sense to you?
Yeah, I think the question is, what happens if you have a cluster running and maybe it's running at a certain capacity and then you just go in and for funsies reduce the instance number. Well, very simply, it depends. Basically, if you reduce the incidence numbers by a little bit, then everything will be handled smoothly because the segments from those instances will be dropped. They will be rebalanced on other instances to maintain the replication factor. Then you will have other instances that are kind of picking up that slack and taking over that workload. So they will be more full.
So, I guess, this indicator here will rise separately. You will also... those instances will be serving more queries and if they were already at a high capacity, you might get queries waiting for longer. That's a metric that you can surface through Clarity. You can see one of the very interesting metrics to look at is by instance what is the query wait time on that instance. Which is basically like, how long are the lines, how long are query just standing in line waiting to be processed.
Ideally, you want that query wait time to be almost negligible, couple of milliseconds. So basically a query comes in and then it gets picked up by an execution context. But, if your servers are over-provisioned in queries, then you will not get that working. Then you'll get lines that are increasing.
Lastly, another place where you could fall behind is ingestion. So again, it really depends on what your capacity is, but each server has a certain capacity for how many messages that has been process and if due to your decrease in instances or maybe an increase in your data volume, your rate of messages is more than your provision servers can cope with, they will start to fall behind. If you're using something like a realtime system, like Kafka kinesis, then we have a metric called the [Kafka 00:40:07].
You want to monitor that and Clarity again, and see that your Kafka isn't increasing. It's just like how many... what's the size of the offset between like the current, the last offset in the stream and the last offset that was processed. You want to make sure that that is a constant number, that doesn't really go up and it's pretty small. Instead, if you see it as a graph that's going up over time, you know that maybe you're under provisioned on the ingestion side.
That really shows how useful Clarity is for diagnosing these problems because it's not a snapshot that you care about. You don't just care about what's my Kafka right now because that could be small or large for many reasons. For example, if you just started ingesting just now, your Kafka would be really large because you haven't caught up to all the streaming yet, but everything is fine. It's just chugging away and doing it. It's only something to be concerned about if you see the Kafka increasing over time.
This is basically the cracks of Druid. You have fundamentally three things happening in any Druid cluster. You have queries that need to be served, you have data that needs to be ingested and then there's segments that need to be loaded and put on servers. So there needs to be enough capacity to serve them.
All those three things is something that part of Imply is offering is that we can help you. We both give you the tools and also the expertise to help you figure out what the sizing is based on those three variables.
So that's a very long-winded answer to the question of, well, if you decrease instances, lots of things can happen. Those are the three places where one of those will basically start breaking depending on your setup and your usage or nothing will break if you don't increase it that much.
Okay. This question is coming at different forums sound like exciting question. What if I want to run the instance as your OCTP... these are two questions, or in my own API assistance?
Yes. So, what an unexpected question I do not expect to be answering. Surprise, I have an answer. Basically if you... So what had been done here is the Imply Cloud AWS based thing. This service right here is integrated into AWS you saw that when I spin up a cluster, it's talking to Cloud formation and when it's configuring the storage, it's configuring S3 base storage, et cetera.
We are now, also, about to roll out a new product which we're testing with our customers right now. We are very excited to get more people interested in participating in our test program here. That is the on-prem manager. On-prem just means that it's 90% of the functionality of what I'm demoing here. The only key differences that instead of going out and provisioning machines for you, like we can do by spinning up instances, you will actually have to provision machines and have them register with this management service, but otherwise, it will act the same. You will get the same interface, you'll be able to do the same rolling updates. You'll be able to configure the same security. That's the vision is to basically match this UI as much as possible.
The use case for that is if you want to run it in just bare metal, maybe like in your own just directly on servers or inside of a private cloud or any cloud that we don't support, right now or even in AWS, but you don't want to have a VPC linked to our VPC. By the way, about that, if that's something that concerns you, I should note that you can turn off a flag that makes it so that Imply can... Basically, other than the management stuff, we can't help and support you log into to help you out in any way. Then you're basically running an on-prem distribution. But if you want not linked to anything, you just want to have your own firm in your own areas, then the on-prem manager would be for you.
Separately, we're also working on adding the Imply Cloud, this interface to other clouds and more on that to follow.
Yes. Stay tuned on that front. Fairly point a future question. Does the Clarity UI also support dashboards like Clarity does?
The Clarity UI is... funnily enough from a technical point of view, a one giant dashboard. But there is currently no support to create your own arbitrary dashboards, but that's something that we're planning on rolling out. It's top of the Clarity roadmap.
Great. How do you handle, quote unquote, light segments basically just data per hour, but just... but that data contains advanced that do not belong to that hour segment and maybe it's from the hour before. Is it possible to handle this sort of data?
Absolutely. So late data is an escapable part of how data works. But what actually will happen... so maybe I'll go into... Well, I'll open this here, I'll go to the managed data just so I have a visual for the... no, hold on. I'm going to manager and there.
So, okay. Usually we create segments on certain second granularity like either per hour or let's say by day or something like that. So let's say you are creating segments for every hour. We don't have to create segments just for the latest hour. In fact, if you're ingesting some data and you have some data for like the previous hour that will just... Let's say you have like one or two column builds for the previous hour. Then you can go in... What will happen is that you'll get like a tiny segment built for that past hour and that hour could have been like a week ago or whenever.
So that is something that naturally might happen. One of the things that we offer is if you're going to our data view, then like if I pick one of these data sources, just kind of random is you can configure what's called automatic compaction.
So the whole idea of automatic compaction is exactly to help with something like that. It's really the Druid cluster equivalent of de-fragmenting your computer, if you remember doing that back in the days, but basically due to certain circumstances, one of them being late data, another one being like ingesting from something like Kafka having like lots of partitions. You can get lots of small segments being created. That's sub ideal for performance because there's a little bit of overhead to reading every segment. So you don't want to have a segment with just a couple of rows in it. Actually, ideally, your segments would be between 400 and 800 megabytes.
What I just did is I configured an automatic compaction for this data source, which really just means that there is going to be a job that periodically looks at this data source, sees if there's any segments for the same hour that together are less than 400 megabytes. Then will pull them out and merge them into one segment and then publish a new segment to override those previous two.
That is a really great feature. This was recently added to Druid. By recently, it was I think two releases ago. Before that, you could still set up compaction. It was just something you have to do manually and you had to like figure out like a cron job or like some an [Uzi 00:49:13] processes or whatever. But now, Druid will just handle it for you. So you just say, "I just did two clicks to set up automatic compaction," and now all those segments will be compacted and I think will be safe in knowing that any late data or any data or any other circumstances that caused many small segments to be created will be automatically taken care of.
Right. Does the Imply Cloud launch spot instances or on demand?
We currently do not do that.
Awesome. Then, last question... and it's a couple of questions about joins. A little bit Druid and one is fairly specific, I guess there's a presto plugin, Druid plugin for joins and GitHub. I don't know if that's something that we're looking at supporting or we're working on?
So there's... is it one of the really sweet features of Druid is that it has a lot of extensibility because our project that from day one had a very powerful extensibility framework that basically allows you to make it like a jar that overrides any part of Druid. As a result there's a bunch of extensions for Druid. Some of them we made, some of them are made by the community and we support and then there's like lots of extensions that the community just provides and you can use them.
One of... in the advanced configuration here, you can actually configure which extensions are supported. These are from the Imply blessed extension list of different interesting things that you could enable like a bloom filter extension. If you want it to play around with bloom filters and have that basically as a new query feature.
Any kind of extension that we don't support here, we still absolutely allow you to do. Again, I must stress that this is not a black box. You are running the hardware and if you want to get down and write a cool Druid extension and maybe you have one, maybe you have a cool aggregator or maybe you have like something that as a query feature, you can absolutely load it here and say, "Okay, add the custom extension, provide a name and a path to where it lives," and then it will be pulled in from S3 or from a URL and it will be put on all the servers and they will be configured according to how you want.
Then in general, for example, in the pivot application, it allows you to hook in to the query at whatever level you want. So if you want to configure a customer aggregator that it doesn't know anything about, you just want us to tell it, "Hey, just put this in your Druid query and send it over." That is absolutely accessible.
So we really want to make this be something that can be easily taken apart. As I said, there's levels of involvements. So medium is like playing with some of our extensions. Some have a warning signs on them because they're community extensions that we don't explicitly approve, but you can absolutely use them. Then obviously anything that you write yourself, if you just want to have a fun hack day, putting together an extension, you go through here.
Cool. I lied, there's one more question and this. Is it possible to run two different Druids for our online offline floods which grade is same S3 and real time data?
Yes. So that's absolutely possible and configurable through this interface. If I go back to my architecture diagram that I'll find somewhere... Where did it go? My slides. My lovely, lovely slides. Okay.
You can always paint a word picture.
Yeah. Okay. Well, so basically, the question is really about having a mixed type used case. You have some maybe up facing queries that you want to happen really quickly and you also have some slower report queries or something that runs overnight and you want to make sure that they don't interfere with each other.
So one of the things that we have Imply is the query server, you can set up two different query servers and basically use them for completely different things. You can also totally duplicate the Druid cluster and have the same data being ingested by both clusters and have it shared…
So basically, any part of the stack can really be shared and some of that is... not literally every configuration. It would be configurable through Imply Cloud, but most of the ones that make sense will be and we'll be absolutely happy to kind of help and support that and figure out how to make you very successful with your use case. But in general, the answer is yes.
Well, a vibrant discussion and great presentation, Vadim. Thank you. Thanks to everybody. I know we're a little past top of the hour. So some of you just starting to drop for back-to-back meetings, I imagine. Very much appreciate your attendance. Again, look in your email that you used to register for the recording and slides coming in the next 24 to 48 hours. Thanks again and we'll see you on this channel soon. Bye now.
Thank you very much.