Driving Innovation with Open Standards: How Voltron Data is Shaping the Data Ecosystem with Apache Arrow and Ibis with Josh Patterson

Jul 27, 2023
Reena Leone
 

Today’s show is all about the world of big data and open source projects, and we’ve got a real gem to share with you—Voltron Data!  They’re on a mission to revolutionize the data analytics industry through open source standards. To unleash the untapped potential in data, Voltron Data uses cutting-edge tech and provides top-notch support services, with a special focus on Apache Arrow. This open-source framework lets you process data in both flat and hierarchical formats, all packed into a super-efficient columnar memory setup. And that’s not all! Meet Ibis—an amazing framework that gives data analysts, scientists, and engineers the power to access their data with a user-friendly and engine-agnostic Python library. Excited to learn more? We’ve got Josh Patterson, the CEO of Voltron Data, here to give us all the details.

Voltron Data focuses on developing open source standards for data analytics. They specialize in comprehensive support services centered around Apache Arrow, which enables efficient processing of both flat and hierarchical data in a columnar memory format. They have also developed Ibis, a portable Python API that provides a stable and consistent interface for data analytics and machine learning across different computational engines.

Josh Patterson, CEO of Voltron Data, shares his background and journey that led him to co-found the company. He discusses the challenges faced in the cybersecurity space, particularly in moving data between systems and frameworks, and how Apache Arrow provided a solution to this problem. We dive into the importance of standards and modular systems in driving innovation and connectivity within the data analytics industry. We also talk through Ibis, which allows users to write code efficiently and run it on different backends, including Druid, accelerating productivity and simplifying data manipulation across various systems. 

Listen to this episode to learn more about:

  • How Voltron Data built a Druid backend in 4 hours with Ibis
  • How Ibis can be used for benchmarking queries and evaluating the performance of different data warehouses, enabling users to quickly test new backends and assess their suitability for analytics
  • How using Ibis and Arrow together enables seamless integration and data pipeline acceleration across various systems and languages
  • And how important open standards are in simplifying data connectivity and reducing complexity for enterprises that utilize multiple technologies in their data analytics processes

Learn more

About the Author

Joshua Patterson is the co-founder and CEO of Voltron Data – a global startup establishing a new way to design and build composable data systems with open standards. Prior to Voltron Data, Josh led software engineering at NVIDIA where he created the RAPIDS ecosystem. Josh worked with leading experts across the public and private sectors and academia to build a next-generation cyber defense platform at Accenture.  He also served as a White House Presidential Innovation Fellow, focusing on high-profile technology initiatives in the federal government.

Transcript

[00:00:00.490] – Reena Leone

Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host, Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, real-time data, and analytics, but way beyond your basic bI. I’m talking about analytics applications that are taking data and insights to a whole new level. And today on the show, we are diving into the big data ecosystem and open source with Voltron Data, one of CRN’s 2022 top hottest big data startups. Voltron Data is committed to developing open source standards for data and unlocking the untapped potential in the data analytics industry. They specialize in cutting edge tech and comprehensive support services centered around Apache Arrow, which is an open source framework that provides a language independent solution for developing data analytics applications. You can see how this all fits right into Tales At Scale. So a little bit more about Apache Arrow. It enables the processing of both flat and hierarchical data in a highly efficient columnar memory format. But that’s not all. There’s also Ibis, which is a framework that data analysts, scientists, and engineers use to access their data using convenient and engine agnostic Python libraries.

[00:01:12.640] – Reena Leone

To give you the full scoop, I am joined today by Josh Patterson, CEO of Voltron Data. Josh, welcome to the show.

[00:01:20.300] – Josh Patterson

Thank you for having me. That introduction was amazing. I’ve never smiled so hard, so thank you.

[00:01:27.550] – Reena Leone

You gave me a lot to work with because Voltron does so many cool things, and in my research I was like, I got to cut this down. I could do like a whole spiel. That’s probably one of the longest intros I’ve done because you do so many cool things, but you as a person have done so many cool things. So let’s start there. Tell me a little bit about yourself, about your background and how you got to where you are today.

[00:01:49.490] – Josh Patterson

Yeah, this could be the whole episode. My background has been a little bit of a winding journey, so I’m going to try to be as succinct as possible. I started off had a master’s in economics. I was running commercial construction company. The financial collapse of 2008, 2009 happened, and while other people were running from finance, I went directly into the heart of it. And I started at Freddie Mac and I was an economist, really kind of digging into the housing crisis. What happened? What could we have done differently? What did affordable lending look like? What was its place in the ecosystem? And while there, we had access to a lot of data, and so we were building systems and writing a lot of code to analyze data in numerous different ways. And macros led on to more macros, which we were doing a lot of things in SaaS, the SaaS macros were really big. And then SaaS jobs, we started paralyzing SaaS jobs by writing jobs that would launch other jobs and by the time I left the financial services space, what I realized is I was doing a lot of kind of system engineering for writing a lot of jobs to break down really large scale problems, manipulate resources of systems better.

[00:03:10.320] – Josh Patterson

A lot of times when we launch the job, you get so many cores. And so if you would launch lots of mini jobs, you get more cores on the cluster. And exploiting resource management of large financial institutions became kind of my thing and helping people build these types of systems. So I took that knowledge of kind of ad hoc distributed computing and I started researching all the trends in the HFS space and joined Accenture Labs. They’re a big data lab out in San Jose, launched the data visualization curriculum for Accenture. We did a lot of things with D3 and kind of the state of the art data visualization early in the time while pairing it with big data and really kind of followed Spark from its early beginnings when it was Shark. And a lot of the open source projects of the time as they emerged, whether it was Druid or presto or impala, Kudu and so we really spent just a lot of time of kicking the tires, a lot of really interesting big data projects, integrating them to kind of showcase the possible where thing was going. I married my wife and moved back east and then took all that knowledge and applied it cybersecurity.

[00:04:20.060] – Josh Patterson

And so we said, okay, some systems were really good for real time, some were good for this, some were good for that. And so we started building these really massive heterogeneous clusters to do cyber defense, mixing in graph analytics, OLAP, better data management. And what we quickly realized is moving data between systems and frameworks was really hard and expensive. And the serialization, deserialization costs were really just eating up a lot of the cluster’s computation. And so the cluster was expanding, not for the compute, but it was expanding for all the data movement between systems. And right around that same time, Wes McKinney was kind of getting a lot of support for Apache Arrow, him and Jacques Nadot. And we were like, this would solve this really nasty problem we were having in cybersecurity, because not only were we having this heterogeneous system problem, we also started dabbling in GPUs. We were using Nvidia GPUs for graph analytics. Doing page rank on a fairly large Spark cluster would take 20-30 times longer than doing that same page rank algorithm on a single GPU server. And so there was this company, Blaze Graph, that eventually got acquired by AWS.

[00:05:30.150] – Josh Patterson

And the team has been really significant in the pioneering of Neptune, and they’ve done some just amazing work in graph. They were showcasing the art of possible graph analytics on GPUs. And this was well before NV graph and Kugraph out of Rapids. And we were just thrilled by the performance. I mean, going from doing page rank in hours to doing in seconds was phenomenal. But moving data was again, the pain point. And so it was just so hard to move data from all these systems. Moving into the GPU, moving into the right format, copying it across the PCIE, express it’s a data movement which is really just slowing everything down. And so we were really just rooting for Arrow to take hold in the industry. And we joined Nvidia So we, being a lot of my researchers at Accenture Labs, in the Cybersecurity Lab, we joined Nvidia all around the same time, and we started building out GPU acceleration for data science. Not just data frames, but machine learning, graph analytics, data visualization. I’ve always been passionate about data visualization, and that was again challenging because all these different applications on the GPU, while we were taking things and making them 20 to 100 times faster, moving data between these applications was a bottleneck.

[00:06:47.990] – Josh Patterson

So we’d have to move all the data off the GPU, just change the format, move it back onto the GPU. And again, we just kept realizing that time and time again, making things fast was not the bottleneck, it was actually making things connective, making things work well together, making things share data formats without having to serialize and deserialize the data across applications. And so when we really kind of wanted to build out Rapids, one of the things that we expressed to Nvidia leadership was we had to build it on open standards, we had to build it on Apache Arrow because we needed a way to fully utilize the GPUs without moving data back and forth. And all the benefits of Arrow for languages would also work very well for hardware. So we built out Rapids, and it was fully based on Arrow. We connected all these things. We started getting these really great end to end performance numbers. We had data readers, data frame manipulation, as I said, ML graph, geospatial, and it was really exciting. But we were very much for the longest at the lowest level, building kernels, plumbing them into systems, getting people to adopt these things.

[00:07:59.280] – Josh Patterson

And after about five years of Nvidia, I really wanted to bring this to market in a more succinct fashion. And we realized there was this constant pain that people had when you were a builder of these systems. A lot of times you build them to solve a pain point for today, but you don’t think about what are the problems of tomorrow? What else is going on around you? I mean, it’s hard enough to build a great startup, it’s hard enough to build a great open source project. You got to think about what are the other open source projects, how you can interact with them, integrate with them, share data with them. And standards are really important. Building modular and composable systems allows people to innovate without having to replace. That’s what is really the core of Voltron Data. And so Darren Haas, one of my co founders, who actually started the company with me, who was at Siri before Siri was acquired by Apple. And then he’s had just an amazing career across GE, Apple, AWS… We started Voltron Data and then we merged or acquired a bunch of different companies Ursa Computing, Wes McKinney’s Company, Blazing SQL.

[00:09:10.150] – Josh Patterson

We hired a bunch of amazing talent out of Nvidia and AWS and Apple, and just a bunch of other companies. And we just really formed this amazing company of just really gifted engineers. And our co founder team is just phenomenal to work with, that really thinks about not just building something new, but what would it look like to push standards, to push modular enclosability, and help people adopt these things so the ecosystem can be more connected. Building new things was easier, new data products was faster, and that’s what we do at Voltron Data. We really want to help people design and develop data systems, not that are future proof, but are more able to adapt to the future because of the standards that they adopt today.

[00:10:03.750] – Reena Leone

I mean, one of those things as big as your relationship with Apache Arrow is, it was actually Ibis that kind of got us in contact and talking. I’ve read that it’s the new front door for data analytics and machine learning, which sounds super impressive. Can you tell me a little bit more about Ibis?

[00:10:22.320] – Josh Patterson

Absolutely. So Ibis is a portable Python API. And so when you think about Compute frameworks, you have how people interact with them, the API layer, and then you have the engine. Ibis separates the two. And so what Ibis does is Ibis gives a stable a constant API that we can bring to other computational engines. And so, whether that is Druid or BigQuery or Snowflake or Impala, dozens of different backends users can write code, and then they can run that code on these different backends, whether it’s a SQL engine or something like Dask or PySpark, and it’s quite flexible, and what it’s meant to do is accelerate people’s time to productivity. A lot of times people have SQL and the world is quickly adopting. Python is becoming one of the most popular languages, if not the most popular language, especially in data science. And so Ibis is just a really great entryway for people to write code efficiently, run it locally, whether it’s with Pandas or DuckDB, which is skyrocketing in popularity, Postgres, or a lot of these other kind of local systems, prototype and then scale it. And so a lot of times people would move data out of a system if they want to use Python, and so they will query a subset of their data and then they were like, all right, I’m going to bring this data locally and then use Pandas or something else.

[00:11:58.790] – Josh Patterson

And with Ibis, I can actually express my logic and let that back end system, use all of its amazing horsepower and processing capabilities to do it right where the data is, and that’s just amazing. It also simplifies a lot of data silo problems. So people have numerous different data silos systems, and they might want to unify how they talk to these different systems. They don’t want their developers to always be jumping through different APIs, SQL dialects, what have you. And Ibis gives them a standardized way to express their data manipulation and run it across various different systems. And we see a lot of hedge funds doing that and talking about how Ibis really simplifies and accelerates their analyst time to driving business value.

[00:12:50.700] – Reena Leone

Well, you mentioned acceleration, and part of how Ibis got on my radar is that there’s an Ibis backend that shipped in 4 hours built on Druid, because it came in through a GitHub request to the Ibis team. 4 hours to build a Druid backend is very fast.

[00:13:08.430] – Josh Patterson

Well, it’s a compliment to Druid. They had a lot of amazing hooks. Their CI CD was great. And so the team, with a combination of SQL alchemy and other things, they basically got it up and running, as we said, in 4 hours. And so most backends take a little bit longer than 4 hours. So, first off, thank you to everyone who contributes to Druid. Y’all have done a phenomenal job. Your Python bindings are great, the SQL alchemy support, and so a lot of things made this a lot easier and faster because it’s just a really great open source project. But I think the other really cool thing is the adaptability of the Ibis community. Someone’s like, hey, I really love Ibis, I love what y’all are doing with these other SQL engines. We use Druid. Could you build a backend? And that’s part of what we do at Voltron Data. We work with customers, and there are customers who have very specific needs, and we want to make sure that they have production support, enterprise support, SLAs, and other things. But we also want to make sure that we are supporting the community. And so when people are asking for new back ends or other things, we want to make sure that we’re delighting our users, both customers or open source users.

[00:14:34.480] – Josh Patterson

And so it was really exciting to see that request and then get it done and see a new user group kind of emerge.

[00:14:41.040] – Reena Leone

Can you tell me a little bit about how they got it done? I mean, I know that Druid already had some built in functionality that allowed it to go quickly, but can we dive in a little bit into that?

[00:14:50.090] – Josh Patterson

Because I think it’s yeah, so so first off, there’s a really great blog about it, and they’ve really gone through it step by step about all the different things that they did. But one of the things is, again, SQL alchemy and the Python bindings. And so with those two things, it really simplified a lot of the work that the team had to do. And so the Druid documentation for the Python bindings really allowed them to kind of quickly get Ibis up and running in a way to connect to Druid. And then Ibis, what it does is it essentially compiles SQL. And so for the Python backends, like Dask and PySpark, it has a slightly different approach, but for SQL backend, it’s essentially generating SQL. And so there’s nothing hidden. And so it’s not like it’s doing some proprietary secret thing. It’s an open source project, and it’s compiling SQL. And so once that SQL is generated, it then runs that SQL on that backend. And so recently, we’ve started to do the same thing with Apache Flink. So normally we talk about distributed systems, Flink more of a real time system. Working with the great folks at Claypot, we’re showing how we can use Ibis to generate SQL that can hit the Flink backends.

[00:16:12.640] – Josh Patterson

And so 4 hours is really quick. Most of that time was CI CD, just making sure that we were passing all of the Druid test. But once Ibis generated SQL, it used SQL alchemy to target the actual Druid SQL dialect. Everything else basically just worked. And I mean, I think it sounds kind of that simple, but it really was that simple in ways. And so a lot of times it’s like working out corner cases, not having kind of great mappings of things, and upstreaming some changes, people not supporting the right Python versions. As Ibis continues to update, we typically follow the latest not typically, we always follow the latest supported versions of Python. So if someone’s, let’s say, on Python three five, where the community is at 310, 311, that’s a challenge. And so Druid was very up to date on all these things, and so it just made it a lot easier to basically get it running, test it out, pass all the tests, and kind of stamp it, okay, this is ready to go.

[00:17:25.920] – Reena Leone

I’m changing our headline to Druid. It just worked.

[00:17:31.250] – Josh Patterson

It would be a great headline because it just works. And that’s really a testament to all the work that the Druid community has done on their CI CD and building bridges to other ecosystems. And so, yeah, I think it’d be a great tagline.

[00:17:49.120] – Reena Leone

Another thing that I saw you use Ibis for was actually to benchmark queries when you’re looking for, say, like, a new data warehouse. And that coming from the database space was very interesting to me. One example I saw was that you could use it when you’re investigating to see if speed gains from running your analytics on an OLAP back end are worth the engineering effort that it takes to export them from like, an OLTP backend. Can you tell me a little bit about using Ibis in this way?

[00:18:18.940] – Josh Patterson

Sure, if you don’t want to think about all the nuances between different SQL dialects, and you just kind of want a straightforward way to express some code, especially if your code is already in Python. Ibis is really great at this. So Ibis will you can kind of say, all right, I want to do this data manipulation. You can load that data into various different backends, and then Ibis will generate SQL for these different backends through different types of compilation methods. And you can basically use this as a way to quickly test out new backends to see their performance. We like to encourage people to think about things as kind of LEGO blocks. And so every time you want to test out a new system, if you have to remove the LEGO blocks of your API, put on some new API LEGO blocks, make sure they fit properly, it just slows down your time to evaluation. And so having this completely standardized way, because as much as people love SQL, there are SQL dialogues, there’s slight differences. And Ibis allows people a not to have to worry about generating really complex SQL sometimes, but also it makes sure that it can run across numerous backends.

[00:19:30.250] – Josh Patterson

So it just really accelerates your time to evaluation. And we don’t like to scare people about the complexity of things, but there are some times where three lines of Python can generate 20 to 50 lines of SQL, and that’s not uncommon. There’s just certain things that you can express very succinctly in Python that are just a little bit more verbose if you want to do them in SQL. And that’s why a lot of data analysts and data scientists enjoy Python. And so when you think about small nuanced differences of back ends, and you’re thinking about complex queries that are 50 to 100 lines of query, it would take a long time to iterate through those lines of code and make all the modifications to get them to run in every different backend. But Ibis just kind of takes that and takes care of it for you. And so these types of standards really allow just acceleration of different things, in this case, acceleration of evaluation. And it’s similar to Arrow, where if you think about if you have a pipeline of data and you’re kind of Arrow in, Arrow out all through your pipeline, if you one day want to remove a system and replace it with a new system, if that system also can generate Arrow, you don’t have to change that ETL process between those two points.

[00:20:57.670] – Josh Patterson

You can just plug in that new system and give it some instruction set so it can generate output that can continue your pipeline. And so with this combination of Ibis and Arrow, now if I have a backend that supports Arrow and a backend that supports Ibis, you get this really cool feature where I can take the same logic, send it to that backend, and that backend can then just send Arrow data directly to that other source. And so think about if I want to upgrade from DuckDB to a distributed system because for some reason it’s too large for my local system. I really need more computation. If both systems can produce Arrow like DuckDB, can, I can immediately just send that Ibis code to it. And now that thing is generating the SQL, it’s running it, and now it’s sending Arrow to your next data point in your pipeline. And everything just kind of works. And that’s the benefit of these modular composable designs and what’s really exciting about kind of evaluating things. And so it just really allows people to prototype locally, deploy it a lot faster, and it just reduces a lot of friction in people’s kind of big data journey.

[00:22:09.890] – Reena Leone

Well, you mentioned Arrow, and that’s actually what I wanted to talk about next. Well, first of all, congrats on 67 million monthly downloads last year. That is incredible for an open source project. But I wanted to dive into it a little bit more because I queued up Arrow in the beginning. But when we were chatting, you said that you wish that Druid developers utilized Arrow a little bit more. So let’s kind of dive into it. Can you tell me a little bit more about Arrow and then where you think it fits within Druid?

[00:22:40.750] – Josh Patterson

So I know this is a podcast about Druid, and I’m going to talk about a few other data systems.

[00:22:46.010] – Reena Leone

No, it’s all one big happy community family of data analytics.

[00:22:51.510] – Josh Patterson

Absolutely. And so I think one of the great examples is really DuckDB. DuckDB has this great relationship with Arrow, and you can do these really great things by extending DuckDB. And it’s because I can use Ibis with it. I can use Arrow. It can connect to a lot of other things. Arrow, I think, now is north of 70 million monthly downloads and growing. It allows this integration, this pipelining of data systems to be a lot easier, and it does this across languages and hardware as well. And so if I want to send things to GPUs, if I want to send things to Go or R, these things get easier with Arrow. And one example of this is recently Snowflake adopted not only Arrow and Nano Arrow, which is a more lightweight version of Arrow for connectivity. They also adopted ADBC: arrow database connectivity, and it really allows them to power applications across Go, R, Rust, Python, SQL+, and other languages extremely easy. And so we’ve been working with the team at Snowflake on adopting ADBC, and it was really just a continuation of what they’ve already been doing with Arrow. And they had a really kind of awesome way of describing it.

[00:24:22.660] – Josh Patterson

And so they basically said they added ADBC support for cross language API support. And they did that with us, Voltron Data, because we help enterprise design and build composable data systems with open standards like ADBC, Arrow, Ibis, Substrate, and more. And so they have a lot of users across the Java, Go, R python, SQL+ ecosystem. And it’s really hard to support all these users equally and give them all these really great benefits without standards. And so standards really allow them to do this kind of ubiquitously, across all their different users. And a while back, Streamlit adopted Arrow, and they were like, we deleted 10,000 lines of code and we improved our data connectivity, our data movement performance by 15 X, and they gained a bunch of new functionality. And so that’s typically what’s the really awesome thing about things like Arrow and these composable modular standards is. You don’t have to reinvent the wheel. And so by adopting these things, you get all these new functionality, you get all these new ways of connecting to different sources, and you reduce complexity. And JDBC, ODBC, they definitely have their place in the data analytics stack.

[00:25:40.290] – Josh Patterson

But with ADBC, Arrow database connectivity, it really simplifies how we connect to analytical applications. We’re just excited about all the companies who are open source projects that are adopting ADBC and Arrow today. And we just want to continue to have more and more people adopt it. Because again, you’re right, it’s one happy data analytics ecosystem. And the more these systems can talk to each other for me personally, the better we can start building kind of these end user applications like cybersecurity systems.

[00:26:13.330] – Reena Leone

And when you talk about enterprises, they’re always using several different technologies, right? It’s not like there’s no one database to rule them all or no one analytics application. Which means the connectivity of systems is even more important. That’s why when I say it’s one big happy family, we’re all part of this data analytics community. If you’re talking to a Fortune 500 company, then they probably have several of these systems in place if they’re utilizing open source frameworks.

[00:26:38.870] – Josh Patterson

Absolutely. And this is why, again, for Ibis, we’re working across local systems, distributed systems, now, real time systems with Flink. We see Arrow adoption across GPU… You know, it’s how they do their tabular data sets, Rey, Spark, so many people are adopting Arrow, whether it’s for ML, ETL, data analytics, it’s really touching so many different parts of the data ecosystem and the data pipeline is because it allows this pipelining of systems to be a lot easier. And so we want more people to adopt it. And as I was saying, cybersecurity is hard. It’s hard enough. Attackers are getting better every day with all these new tools that attackers have, it’s getting cheaper and cheaper to do more complex and kind of destructive attacks on enterprises. And so enterprises have to continue to get better at cybersecurity. And one way to get better is by doing things more efficiently so you can do more things. And so these standards help people do more with less essentially.

[00:27:48.320] – Reena Leone

We’ve been talking a lot about open source standards, and one thing that is very core to your organization is open source. And I know we’ve been diving into the technical aspects and how things are set up. As a leader of the company, this is part of your culture. And I love asking this question of leaders and founders in particular. Why is open source and open standards so important to you?

[00:28:10.370] – Josh Patterson

I think open standards are becoming more important than open source. And they’re both important, but they’re so important to me because they allow people to innovate faster. And ultimately, when new technologies come in, if you have to rip and replace everything and start over, that’s a lot of sunk cost, where if you can augment systems and transition to new systems faster, easier people will welcome more innovation. And then essentially, technology is not slowing down. In fact, it’s accelerating the pace of AI innovation is staggering. NLP two years ago versus natural language processing today, night and day difference. And we need a way to adopt machine learning, deep learning, generative, AI, all these new things. But we can’t just leapfrog a generation of technology and just expect all these things to work. You have to bridge these systems together. And spending five years at Nvidia, one of the things that we started seeing from essentially the largest organizations, they ended up spending more and more time on data processing, ETL, data management, than machine learning, which is a little bit mind boggling. A decade ago, if you were doing kind of state of the art machine learning, deep learning, that was 90% of what you did.

[00:29:42.970] – Josh Patterson

That was 90% of the time of your systems. That was 90% of the joules, the flops, the systems computational time. According to Meta, in the last three years, they’re shifting that to almost 60% of their time being ETL and data management and data preprocessing. And they think that problem is going to get 13 times worse in the next three years. And so there’s clearly this innovation happening on making things faster, but pipelining is getting harder. And so these standards allow us to pipeline these systems a lot better. And so if I can go from a pipeline of just doing kind of data analytics just to understand a problem, to then predictive modeling and using kind of your more traditional modeling and then moving something to, like GBDTs, gradient boosted decision trees or random forest to add more fidelity or a more nuanced model and then moving that to machine learning and deep learning. If the inputs to all these different modeling frameworks is the same, it allows that transition to happen more graciously. But if I have to go and redo my ETL pipeline to go from ScikitLearn to XGBoost to PyTorch, that’s very cumbersome.

[00:31:00.520] – Josh Patterson

It’s difficult. Then I have to be like, oh, do I really want to do this project or not? Do I really want to adopt this new innovation? Is that model lift really worth it? And of course, as you adopt new technologies, your ETL refines and. You might want to compose your systems a little bit differently through that machine learning process. The faster you can do those things, the better your model is going to be, the faster you can iterate across your preprocessing, the more things you can try out. And one aspect of that speed and performance is this data connectivity is just being able to pipeline these systems together by not paying those serialization deserialization costs, is being able to use things like Ibis to simplify how you talk to these different systems. And that provides some acceleration into this innovation of its own. And so as companies are trying to build and maintain these very complex systems, adopting open standards really allows them to move into the future faster, because there’s no future proofing the space changing so fast. We’re always going to be adding new systems, refining workloads, rebalancing what systems do what, and so these standards allow that to happen more graciously.

[00:32:18.270] – Reena Leone

Yeah, because you want to spend your time building analytics applications. You want to spend more time investing in machine learning, to your point, than just making sure that everything connects and flows properly.

[00:32:30.100] – Josh Patterson

Absolutely.

[00:32:30.960] – Reena Leone

That’s not the fun stuff. That’s just the stuff that you need to do.

[00:32:34.400] – Josh Patterson

Right. And rewriting code because you want to move systems or transcribing code across languages, it’s just not fun. It’s not what data scientists want to do. It’s not what ML Ops engineers want to do. And so the more we can make their jobs easier, the more they can do the things that drive business value, not the things that keep the lights on and provide maintenance.

[00:32:59.870] – Reena Leone

Yeah, exactly. I mean, that’s kind of like one of the things I like to highlight on this show, is people doing cool things with analytics. And I feel like this technology is enabling people to do cool stuff.

[00:33:11.120] – Josh Patterson

Absolutely.

[00:33:12.270] – Reena Leone

Awesome. Well, Josh, thank you so much for joining us today. This has been absolutely fascinating. I wish I had an open source bingo card for all the technologies that are connected and that you guys are working with. Super amazing stuff that you got here.

[00:33:29.330] – Josh Patterson

Thank you so much. I really enjoyed it. And if there’s anything else that we can do to help the Druid community open up a GitHub issue on any of the projects, or just shoot us the email at Voltron Data. And we’re always excited to build more bridges across the data analytics.

[00:33:48.730] – Reena Leone

Yes. Let’s build the ecosystem. Fantastic. If you’d like to learn more about Apache Arrow, Apache Ibis, or anything we talked about today, please visit Voltrondata.com. If you want to learn a little bit more about Apache Druid, please visit druid.Apache.org. And if you want to learn about what we’re doing here at Imply, please visit imply.io. Until next time, keep it real.

Let us help with your analytics apps

Request a Demo