Scaling with Speed: How Atlassian’s Confluence Big Data Platform Team Delivers Customer-Facing Insights with Apache Druid with Gautam Jethwani and Kasirajan Selladurai Selvakumari
In the world of tech, Atlassian needs no introduction. They’re a leader when it comes to collaboration, development, and issue tracking software tailored for teams. One of their flagship products, Confluence, is a team workspace cloud application, fostering collaboration, knowledge sharing, and more.
Meet the Confluence Big Data Platform Team, the driving force behind the user-facing analytics features within Confluence, delivering invaluable insights into user behavior. On this episode, they discuss their experiences using Apache Druid for Confluence to address performance issues and achieve sub-10 second query times. They highlight the benefits of Druid’s column-oriented storage, time-based partitioning, pre-aggregation at ingestion time, and approximation algorithms.
Atlassian also uses Druid as the core of their behavioral event system. Druid enables faster real-time access and handling billions of events daily without performance issues.
But that’s not all! Listen to this episode to learn more about:
- How the Confluence Big Data Platform Team successfully migrated from Postgres to Druid
- How they leveraged Druid’s native query types to create platform APIs for other teams to use the data, offloading computational load from Confluence itself
- How Atlassian relies on Druid for its performance, scalability, and the ability to handle large volumes of data in real-time
- Best practices and tips from the Confluence Big Data Platform Team
- Data Products at Scale: How Atlassian’s Big Data Platform Team Delivers Insights and More with Apache Druid
- Druid Summit 2022: Delivering insights to Confluence Users at Enterprise Scale
- Atlassian Switches from PostgreSQL to Druid for Customer Analytics
About the Guest
Gautam Jethwani is a software engineer working on the Confluence Analytics team at Atlassian, responsible for architecting new backend features and maintaining and expanding data infrastructure. His technological passions include data, augmented reality, 3D graphics and music technology. Outside of work, he is an avid rock climber and volunteers for Meals on Wheels. Gautam holds a bachelor’s degree and master’s degree in computer science from the University of Southern California
Kasirajan Selladurai Selvakumari is a Senior Software Engineer in Confluence Analytics team at Atlassian leading efforts in building the big data platform for Confluence. He is passionate about building data systems and solving distributed system problems at scale. Outside of work he is interested in traveling and hiking.
[00:00:00.490] – Reena Leone
Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache, Druid, real time data, and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level.
[00:00:19.300] – Reena Leone
So if you work in tech at all, or even if you don’t, you’ve probably heard of Atlassian. Atlassian provides collaboration, development and issue tracking software for all kinds of teams. One of their most popular products is Confluence, which is an industry leading team workspace cloud application used for collaboration, knowledge sharing, and more. And I actually use that myself pretty much every day. The Confluence Big Data team is responsible for building and maintaining the user facing analytics features within Confluence, bringing user behavior insights to the end user, and spoiler. They’re using Druid and Imply to do so. Joining me today to take me through the project are Gautam Jethwani, Software Engineer at Atlassian, and Kasi Selladurai Selvakumari , senior Software Engineer at Atlassian. Gautam, Kasi, welcome to the show.
[00:01:08.120] – Gautam Jethwani
Hi, Reena. Thank you.
[00:01:09.740] – Kasi Selladurai Selvakumari
Hello, Reena, nice to be here.
[00:01:11.780] – Reena Leone
So I like to kick off every episode with a little bit about you and your journey and how you got to where you are today. So can you both introduce yourselves a little bit and tell me about your background and how you got to where you are today?
[00:01:23.040] – Kasi Selladurai Selvakumari
Sure, I can go first. Hello, I’m Kasi. I’m a senior software engineer at Atlassian. I’ve been Atlassian for about three years now. I have a master’s in computer science and I live in the Bay Area. Apart from work, I have a one year old, so a lot of my time these days are spending time with her and playing with her
[00:01:45.520] – Reena Leone
Aww, they’re so cute when they’re that age.
[00:01:47.840] – Kasi Selladurai Selvakumari
True, yeah, she’s very cute.
[00:01:50.130] – Gautam Jethwani
Cool. Yeah. I’m Gautam. I’m a software engineer at Atlassian. I’ve been with Atlassian for about two years full time, and I interned with them in 2020. I also have a master’s in computer science from USC, and I live in the Bay Area. When I’m not coding or software engineering, I am looking for the best places to do outdoor sport climbing in the Bay Area.
[00:02:12.320] – Reena Leone
Oooh, that sounds exciting.
[00:02:14.130] – Gautam Jethwani
Can be. Not as much as Druid.
[00:02:15.940] – Reena Leone
Well, okay. All right. You don’t have to ham it up. This is only the beginning of the show, you know, like I said in the intro, a lot of people are familiar with Atlassian if they haven’t used Jira or Confluence before, do they even work in tech? Actually, since you’re on the Confluence team, I want to focus on that. But on the off chance that someone hasn’t used it, can you tell me what Confluence is?
[00:02:42.720] – Gautam Jethwani
Yeah, I can take this one. So Confluence is basically Atlassian’s wiki and open collaboration tool. It allows companies to host their knowledge base in a secure and reliable way and enables teams to collaborate in a more effective way in today’s remote era. So if you’re at work and you need to gain knowledge about anything, Confluence is the place you look.
[00:03:03.720] – Reena Leone
And so can you tell me a little bit about your team and what you’re working on for Confluence?
[00:03:10.150] – Gautam Jethwani
Yeah, for sure. So we are on the Confluence big data platform team. Our mission is to supercharge Confluence with insights that accelerate and transform teamwork.
[00:03:21.550] – Kasi Selladurai Selvakumari
Also, in recent times we have expanded our domain to power other Confluence features such as recently viewed pages spaces as well as homepage feed experience like popular and following feeds.
[00:03:33.030] – Reena Leone
Very cool. So you mentioned Druid up front. Can you tell me a little bit about why you were looking for Druid or a little bit about what your tech stack was like prior to adopting it?
[00:03:48.300] – Gautam Jethwani
Yeah, so before using Druid we were actually using Postgres as our main database and our queries were running very slow. We were basically storing billions of analytics events in a Postgres database. Granted, we did have one schema per customer or per tenant, but even then our performance still got very slow. For our biggest tenants, it would go up to 60 seconds for some queries as our data set grew. So obviously something needed to change and we switched to Druid and it has enabled us to grow and onboard new use cases which we would have never been able to do in sub-10 second query time for even our biggest queries.
[00:04:31.360] – Reena Leone
Wow, sub-10 seconds did you say?
[00:04:35.270] – Gautam Jethwani
Sub-10 seconds for our biggest ones. The ones that used to take more than 60 seconds. Most of our queries run in like subsecond latency.
[00:04:42.160] – Reena Leone
Cool. Okay, so latency was part of the reason that prompted and search for a new database?
[00:04:48.930] – Gautam Jethwani
Oh yeah, most definitely a lot of the reason, but there are a few more that Kasi can go over.
[00:04:55.700] – Kasi Selladurai Selvakumari
Yeah. So as Gautam mentioned, we were using Postgres and we had quite a few challenges with that. The first one was high maintenance cost, so we used to have one schema per tenant. So for example, if there are N tenants then we would have N schemas. So any migrations and changes to database level things became really hard. Another overhead was we had to maintain some of the aggregations and code instead of database doing it for us in a performant way. A key thing with Postgres performance were vacuums. Think of vacuum as garbage collection in Postgres and this was really ineffective for us as they didn’t help much in query performance and we have to maintain a separate task to run this periodically.
[00:05:38.970] – Kasi Selladurai Selvakumari
The second biggest thing was slow response time. I think Gautam talked about it a little bit, so the queries were really slow, especially for our enterprise customers. To give you some data points, our P99 for site analytics query, but above 6 seconds for our enterprise customers. As you can see, this was a big bottleneck for us, as that’s not a great user experience. The third biggest thing was it was really expensive to deliver new smart experiences with the way we have structured things.
[00:06:13.190] – Kasi Selladurai Selvakumari
We had to make a lot of data model changes for even delivering a simple feature.
[00:06:17.650] – Gautam Jethwani
So what it really boiled down to at the end of the day is that we’re an analytics team and we needed an analytics database. With Druid, we were able to support low latency data ingestion, fast queries, accommodating, growing data sets, scale for future use cases, all benefits of scale that we didn’t get with Postgres.
[00:06:35.320] – Reena Leone
When did you guys first hear about Druid?
[00:06:37.640] – Kasi Selladurai Selvakumari
So our team did a spike on Druid around 2020 ish. This was before Gautam or I joined the team.
[00:06:44.710] – Reena Leone
Okay, so you kind of like, inherited it. Did you know about Druid prior to joining this team?
[00:06:50.130] – Gautam Jethwani
Actually, when I was interning in 2020, we were still in Postgres, but I was hearing whispers of it in the team… Druid, Druid. And actually during that time, I was tasked with cleaning up my Postgres database. And then one year later, I rejoined full time. They were already on Druid, so I was like, what did I clean up for?
[00:07:12.030] – Reena Leone
So let’s go into a little more detail. Why did Druid work for some of these use cases that we’ve just been talking about?
[00:07:19.690] – Gautam Jethwani
Yeah, for sure. So Druid gave us, like I said, a lot of things that Postgres didn’t give us. First things first is column oriented storage. It loaded the exact columns needed for a particular query, which is a great optimization that Postgres didn’t give us. Postgres would be loading in entire rows at a time that we didn’t need. Time based partitioning. Obviously, we all know that Druid partitions data by time into segments, meaning we only access the data that matches the time range in our query, which is great because most of our queries are actually time range based. There’s also pre aggregation at ingestion time. We can have automatic summarization of data at ingestion time, allowing us to boost performance, but also save on storage and remove a lot of aggregation code from our code base, which again is another huge code cleanup that we were able to do. Approximation algorithms. It supports algorithms to quickly compute expensive aggregates by trading off accuracy at both ingestion and query time. And ingestion is a huge one. It can ingest millions of events within seconds, supporting both real time and batch ingestion. The batch ingestion allowed us to backfill so we can easily onboard new data for new use cases.
[00:08:33.390] – Gautam Jethwani
And the native real time ingestion allows us to offload the ingestion from our main service, which again, just took a bunch of load off of our main docker container. And it was great. It really helped our performance and code maintainability.
[00:08:47.200] – Reena Leone
So we have established you’re using Postgres before can you walk me through how you switched over to Druid?
[00:08:53.490] – Kasi Selladurai Selvakumari
Sure. The first things was we spun up the Druid cluster ingested real time data, backfilled the past data. So we got to a state where we had the same data flowing into both the systems. Now we wanted to validate the data written by the queries. So in order to do that, we decided to start operating Druid in production in shadow mode. So whenever a query was made to Postgres, we emitted the equivalent query to Druid asynchronously. We then automatically compared results and reported the dips to our observability system. So this pretty much enabled us to roll this feature out without any customer impact as we were able to correct all discrepancies before fully rolling out to production and using Druid. So once 100% of the traffic started going to Druid, we decommissioned our Postgres cluster, and yeah, we threw a party. That was a great moment. We didn’t have any elephant in the room.
[00:09:51.120] – Reena Leone
Awesome, I love to hear that. And thank you for explaining shadow mode to me because that would have been a follow up question.
[00:09:58.190] – Kasi Selladurai Selvakumari
Yeah, glad that could help.
[00:10:00.800] – Reena Leone
Okay, so you guys were presenters at Druid Summit last year in 2022, and you mentioned in your presentation the ability to onboard platform APIs. Can you tell me a little bit more about that?
[00:10:14.090] – Gautam Jethwani
Yeah, so platform APIs are just APIs that are meant to serve a singular purpose for any use case, not just powering UI experiences. So it allows other teams to come in and use our data for their own use cases. Any team can come in and be like, hey, we need this data. And we’ll be like, great, we’ll whip up an API for you real quick. So this allowed us to take some of the computational load away from Confluence itself because they could just use our APIs for different queries and experiences that they were already powering. And Druid’s native query types were really powerful here because they allowed us to very quickly create these APIs as most of these use cases fit within the buckets of the different native query types. So it was basically just plug and play for most of our APIs. And the code is again, very easy, very readable, very maintainable.
[00:11:09.750] – Reena Leone
Very cool. Okay, so what does your architecture look like now? Can you kind of give me a brief this-is-an-audio-podcast overview?
[00:11:17.190] – Kasi Selladurai Selvakumari
Well, I’ll try to do justice to that. So Druid helps us in faster real time access, and it’s the heart of our behavioral event system. In Atlassian, we have a central event bus platform. Multiple servers can write and consume from the event bus in a pub sub fashion. Our team, for example, consumes from Atlassian’s event bus using multiple Kinesis stream and then writes it to Druid.
[00:11:43.310] – Reena Leone
OK cool cool cool. What does your data volume look like by any chance? Like, how many events are you dealing with on a daily basis?
[00:11:50.400] – Gautam Jethwani
On a weekday when we get most of our traffic, it could probably be like a billion events that we ingest and Druid doesn’t even break a sweat.
[00:12:00.130] – Reena Leone
Okay, so how is the performance? Give me a ballpark. What you’re dealing with right now.
[00:12:05.540] – Gautam Jethwani
Well, it’s definitely way faster than Postgres. It would be kind of disappointing if it wasn’t if we were getting more than 60 second request time.
[00:12:12.650] – Reena Leone
Oh God, yeah, that would definitely be a problem.
[00:12:15.080] – Gautam Jethwani
Just a little bit. But Druid returns mostly in just a couple of seconds. It’s very lightning fast.
[00:12:22.030] – Reena Leone
And you just said for ingestion you’re using Kinesis?
[00:12:25.170] – Kasi Selladurai Selvakumari
Yeah, Kinesis, actually. So we are an AWS house and with our event platform support, Kinesis is something that we can easily use to ingest data into Druid.
[00:12:35.570] – Reena Leone
Speaking of Druid, we’ve had a lot of releases, like 26 and 27 are both out at this point of this recording. Is there anything that you’re excited about in the upcoming roadmap or anything on your wish list that you would like to see added to Druid to make your lives a little easier?
[00:12:55.180] – Kasi Selladurai Selvakumari
Yes, we do. So we are really excited about the multistage query engine and the potential it has to do backfills better. We haven’t got a chance to use it yet, but I’m pretty sure that it will come in handy the next time we want to change some data. I have a couple of wish lists. The first one is updates in Druid. While Druid is really good with ingesting large volumes of data, when it comes to updating a particular attribute in a data source, it becomes really hard. Wish there is a better way to do it without ever have to reinvesting everything. The second one would be auto scaling in Druid. You know a few times in the past we had to scale our Druid cluster based on the traffic patterns. Wish Druid automatically scales for us. Yeah, those are the two wish lists that I had.
[00:13:54.920] – Gautam Jethwani
Kasi’s already sent these to Santa. He’s hoping he was a good developer.
[00:13:58.340] – Reena Leone
Well you know, I think the roadmap is actually published now too on Druid’s GitHub. So we can see if any of those things are on the roadmap or when they’re coming. So we were just talking about that actually the other day.
[00:14:12.930] – Reena Leone
Okay, so do you have any advice for people who have maybe been struggling with similar challenges that you were faced with? Maybe they’re also still on Postgres and not seeing the query performance that they’re looking for. What advice could you give them or how could they get started?
[00:14:30.300] – Gautam Jethwani
Oh yeah, loads. So it’s all a learning game. Druid is an incredibly powerful but also very complex piece of technology. I always tell new hires to our team that Druid is a beast that is not to be trifled with. And it’s the very first thing they should learn. So my advice would be to learn and understand the technology fully before diving in. You can never learn too much, never understand too much. Learn about the different query types, the architecture, the node types, how data is stored, ingested, clarity, debugging, anything and everything you can. To this day, I’m still learning myself, and I work on Druid every day. But most importantly, have fun with it. It’s super cool to see your data ingested, processed, queried, and just queries returning results in a couple of seconds, scanning through billions and billions of rows of data.
[00:15:26.970] – Reena Leone
I mean, not to like plug, plug, plug, but Imply can also help you. If you are new to Druid, we have some resources available to help you get started. And then there’s always the fantastic open source community. Head over to the Apache Druid Slack channel if you have questions. Everyone there is super nice and very very helpful. I think that’s going to do it for my questions for you guys today, unless there’s anything else you want to cover.
[00:15:56.450] – Gautam Jethwani
No, I think that’s it. You know, we talk about Druid a lot, and this pretty much covers it.
[00:16:03.860] – Reena Leone
Awesome. Well, thank you. Gautam thank you, Kasi, for joining me today.
[00:16:07.780] – Kasi Selladurai Selvakumari
Thanks for having us. Yeah.
[00:16:09.540] – Gautam Jethwani
Thank you, Reena.
[00:16:10.750] – Reena Leone
All right. To learn more about Confluence or anything that Atlassian up to, please visit atlassian.com. To learn more about Druid, please visit druid.apache.org. And to hear more about what we’re working on at Imply, please visit imply.io. Until next time, keep it real.