Decoding Emotions: Leveraging ChatGPT and Apache Druid for Sentiment Analysis
Whether you’re a data engineer, data scientist, technology enthusiast, or just a person on the Internet, you’ve heard about ChatGPT. But did you know that you can use Apache Druid and ChatGPT in tandem to achieve awesome things? Learn how these two cutting-edge technologies, which are revolutionizing the world of real-time analytics and natural language processing, go together.
Did you know you can combine ChatGPT with Apache Druid for sentiment analysis? Sentiment analysis typically involves handling substantial amounts of data, encompassing social media posts, reviews, and customer feedback. By leveraging Apache Druid, this data can be efficiently processed, aggregated, and analyzed at scale, unveiling deeper insights and identifying patterns and trends. These capabilities are particularly valuable for businesses seeking to swiftly respond to shifts in customer sentiment, enabling timely actions and informed decision-making.
Rick Jacobs, a Senior Technical Evangelist at Imply, discusses how he used a combination of Apache Druid and ChatGPT for advanced data analytics. He explains how ChatGPT’s ability to analyze tweets and determine sentiment and user types pairs well with the speed of Druid for real-time analysis, making it ideal for monitoring tweets and detecting immediate events or threats. Rick will also give us a sneak peek at his next project, which involves using Druid and AI to analyze churn—and to send retention messages based on the sentiments of product reviews.This episode showcases the practical applications and benefits of integrating Druid’s speed and seamless compatibility with other data sources, such as Apache Kafka.
Listen to the episode to learn:
- How to use ChatGPT and Apache Druid for sentiment analysis
- Practical, real-world use cases for real-time sentiment analysis
- Why it’s important for AI systems to align with human intentions and values
Learn more
- How to Build a Sentiment Analysis Application with ChatGPT and Druid
- Wow, that was easy – Up and running with Apache Druid
- Documentation: Ingestion in Apache Druid
About the guest
Rick Jacobs is a Senior Technical Product Marketing Manager at Imply. His varied background includes experience at IBM, Cloudera, and Couchbase. He has over 20 years of technology experience garnered from serving in development, consulting, data science, sales engineering, and other roles. He holds several academic degrees including an MS in Computational Science from George Mason University. When not working on technology, Rick is trying to learn Spanish and pursuing his dream of becoming a beach bum.
Transcript
[00:00:00.410] – Reena Leone
Welcome to Tales at Scale, a podcast that cracks open the world of analytics projects. I’m your host Reena from Imply, and I’m here to bring you stories from developers doing cool things with Apache Druid, real-time data and analytics, but way beyond your basic BI. I’m talking about analytics applications that are taking data and insights to a whole new level. And unless you were in cryo-sleep on a mission in deep space, you’ve probably heard about OpenAI’s Chat GPT. Generative pre-trained transformers like Chat GPT, for example, are artificial intelligence models that use deep learning techniques to generate human like text. And honestly, they’re getting good, like scary good at it. GPT models are based on deep neural network architecture called a Transformer, which is known for its ability to handle sequential data effectively. So what does this all have to do with data analytics and/or Apache Druid? Well, you can combine a trained natural language processing model with Apache Druid for sentiment analysis. And this isn’t just hype. I’m joined by someone who did that very thing, rick Jacobs, Senior Technical Evangelist here at Imply. Rick, welcome to the show.
[00:01:04.710] – Rick Jacobs
Glad to be here. Thanks for having me.
[00:01:06.560] – Reena Leone
So I like to always kick off with asking my guests a little bit about themselves and how they got to where they are today. So can you tell me a little bit about your journey?
[00:01:15.930] – Rick Jacobs
Sure. So I’ve always been curious about technology, so as a kid, I was a curious kid. I did get the opportunity to do some programming back in high school, so I’m dating myself a little bit, but back then I think it was Apple Two, or it might have been a Macintosh, but that’s what we were using. And I was coding in a language called Basic. And as the name suggests, it was a very basic language.
[00:01:40.190] – Reena Leone
You have worked as like a data scientist and a data engineer. Can you tell me a little bit more about that?
[00:01:45.220] – Rick Jacobs
Yeah, so started out in development, so I got a Master’s in Computational Science, started doing some development work that was pretty cool. And I moved from there into systems engineering. Again, more on the development side. And then I became an SE. So a sales engineer, which means I’m an engineer, but I’m working with the sales teams to try to generate revenue. That’s a revenue generating function. Then I moved over to marketing. So now I’m a technical marketing manager. I do things like blogs. I think that’s what we’re going to discuss today and those types of activities.
[00:02:21.780] – Reena Leone
But I feel like you’re so much more than that because you are a tinkerer, right? You are in there, you are figuring out different ways to do things, how stuff works, you’re doing demos. And one of the things that actually reason we wanted to do this episode is because you have been playing around with Chat GPT and Apache Druid, which is what this show is all about. Before we get into that, when did you first start messing around with Apache Druid? Were you familiar with it before you were at Imply, or is this a new thing for you?
[00:02:54.120] – Rick Jacobs
So I had heard of Druid previously, but Druid is a high performance analytical database, and in my past duties, I didn’t necessarily have to utilize Druid. So I heard of the Apache Druid project. I did look at it, sparingly, but I hadn’t started really tinkering with it until very early this year, so, like, in the beginning of this year.
[00:03:14.900] – Reena Leone
And AI has been a hot topic, by the way. I feel like I say hot topic on this show all the time, but here’s another hot topic. And with the introduction of Chat GPT, it’s never been more top of mind for folks, not just in our tech community, but kind of everywhere. You’ve been exploring how Druid can work with Chat GPT, and that might not be obvious to some folks. So can you talk to me a little bit about how you got started working with Druid and Chat GPT together?
[00:03:44.970] – Rick Jacobs
Sure. So I did do some data science work back in my development days. I did quite a bit of that, utilizing various platforms, some open source, some not. So with Chat GPT becoming available and hearing so much about it, what I thought of doing is utilizing it within the Druid environment. So to try to create some applications that use Chat GPT as the modeling, as the AI model and then Druid as the back end database.
[00:04:12.230] – Reena Leone
So can you give me some practical use cases of what you came up with?
[00:04:16.340] – Rick Jacobs
Yeah, tons of use cases. So the one I came up with was more social media analysis. So we’re using Chat GPT to analyze tweets. So what we do is we connect to Twitter, get some tweets back from Twitter based on a certain criteria. So I think the criteria I use for this one was actually Chat GPT. So I was looking for tweets on Chat GPT, took those tweets and then used Chat GPT’s sentiment analysis functions to determine the type of user that would send that tweet, the sentiment of the tweet, et cetera. Then I utilize that knowledge from within druid.
[00:04:52.710] – Rick Jacobs
That’s kind of meta, using Chat GPT to do sentiment analysis on itself basically!
[00:04:58.700] – Rick Jacobs
Yeah, it was pretty interesting. I thought that would be cute. And it does seem to have a certain bias towards itself. I didn’t get into a whole lot of that, but as I was going through the data, it seemed to like itself quite a bit.
[00:05:12.820] – Reena Leone
I don’t know if that’s, like, cool or scary. Okay, so the data set that you used was tweets from Twitter, correct?
[00:05:20.570] – Rick Jacobs
Yes, ma’am. So I connected to Twitter, got all the recent tweets about Chat GPT, and then sent those tweets to Chat GPT and asked it a range of questions regarding those tweets.
[00:05:30.530] – Reena Leone
What did you use to ingest that data into Druid?
[00:05:33.890] – Rick Jacobs
So Druid’s got several ingestion opportunities or several ways to ingest data into Druid. So I use an ingection spec. So you can create an ingestion spec in Druid and then execute that ingestion spec. So I did that, but I did that automatically from code. So create the spec and then ingest the spec from code.
[00:05:53.270] – Reena Leone
And how did you plot the distribution?
[00:05:56.060] – Rick Jacobs
Plotting the distribution was fairly easy. I use a library called Matplotlib. It’s a very popular data ingestion library, so it’s pretty well documented. It can produce different types of graphs. This one, I just did a simple pie chart.
[00:06:13.520] – Reena Leone
And though it’s kind of funny when you read things out loud like that or talk about things out loud, because it’s like code snippets or database libraries or different libraries are meant to be seen on a screen. And for those who are listening, Rick actually has a fantastic blog post about this very thing that has code snippets and everything, so you can see how he did it. But I want to kind of go back a little bit and talk a little bit about why you chose Druid- aside from our own bias towards the technology for this scenario- why is Druid a good choice to do sentiment analysis with Chat GPT?
[00:06:50.380] – Rick Jacobs
Yeah, so the major edge that Druid has is its speed. So subsecond speed for just about any query. And that’s really why I utilized Druid for this particular situation, I’m doing analysis. Druid is an analytical database, and I need it to be fast. So in the real world, if you’re deploying something like this, trying to think how to put this nicely, so if you’re an agency, whether that be government or private, and you’re trying to monitor tweets, you need the results of your analysis quickly. You can’t wait till tomorrow to get the analysis because let’s say you’re NYPD, you try and look at bad guys. Knowing that the bad guy is going to strike tomorrow doesn’t help when he strikes today, right? So if you tweet something about it, you need to catch that immediately, which is one of the use cases for something like this, by the way. So in that situation, time is of the essence. Time is very important. So, to answer your question directly. I use Druid mainly because of the speed.
[00:07:47.390] – Reena Leone
Actually. That is kind of an incredible example that’s like, a lot of times we’re using ad tech and IoT like kind of less dire examples, but that could be incredibly important. Sticking with Druid, what makes Druid a stand out in real time environments? You mentioned speed. How does it integrate with other data sources or databases in streaming technology like Kafka or Kinesis?
[00:08:19.320] – Rick Jacobs
Yeah, that’s important too. So, again, it’s important to be able to interact with other streaming services, and it’s also important to be able to utilize the benefits of speed. Like I mentioned before, that Druid has a seamless interaction with Kafka. So you don’t have to do any connection. You can just use the Druid UI, specify the topic you’re trying to ingest from, and that’s pretty much most what you need to get Kafka interested into Druid. That’s obviously a big benefit. In my situation because I was ingesting tweets, I didn’t ingest the tweets directly through Kafka. I could have done that, but I wanted to show how you could ingest batch data into Druid. So I collected the tweets as a batch and then ingested them versus I mean, ingested them as a batch versus ingesting them as a stream directly from a service like Kafka.
[00:09:08.250] – Reena Leone
Okay, so you have options. Well, I mean, we talk about the best of both worlds is another kind of thing that we’re talking about a lot on the show where you can Druid is great for batch but also for streaming. So whatever way you’re dealing with data is fine.
[00:09:22.690] – Rick Jacobs
Yeah, I just wanted to add so the reason to save it generally what you do in situations like this. Again. Let’s use NYPD. Again, if you’re trying to keep tabs on what people are saying on Twitter, for example, you may save that data down as a CSV file because other analysts are going to use it in that format. So you have other analysts using the CSV format to do what they do. And then in situation like this, somebody like me who is more technical is using it in CSV format, but uploading it to a database so that I can have that persistent data, which means I can check it against other tweets that come in, say, tomorrow. So I can check today’s tweet against tomorrow’s tweet and see, I could test, for example, if the rhetoric is getting more violent, let’s say
[00:10:07.570] – Reena Leone
I was going to say inflammatory.
[00:10:09.240] – Rick Jacobs
That’s a great word. You can compare it’s called regression analysis in the big data world. Regression analysis. So you can do some regression analysis in terms of how these tweets are changing and how they’re becoming more inflammatory, as you said. Hate to keep using this NYPD example, but I think it’s one that most people understand.
[00:10:31.960] – Reena Leone
Yeah, I mean, especially in today’s world. I wish most people didn’t understand that, but that’s where we are. Okay, so are there any other projects you’re working on with Druid and GPTs or any other AI-related projects that you’re working on right now?
[00:10:52.580] – Rick Jacobs
Yeah, so one that I’m working on that’s pretty interesting. It’s similar to the previous one in terms of you need the responses quickly. So again, I’m using Druid for that. I’m probably going to do Chat GPT as the model, as the model that we ask our questions. But I can replace Chat GPT with another AI model if I feel the need. But this one is doing churn analysis. So basically you enter a review of a product, let’s say. So think of Amazon. You’re reviewing a product and you enter your review and the model senses whether that review is positive or negative. And based on the level of negativity, let’s say it sends you a retention message. So it’ll send you an email to try to retain you if the model determines that you are a turn risk. So if it determines that you might leave, we automatically send you an email to try to retain you 20% offer or something like that. Again, speed is very important. You want this person to hit enter on sending that review and then you want, before they can blink, the email is in their inbox.
[00:12:00.270] – Rick Jacobs
That’s kind of what I’m working on currently.
[00:12:02.870] – Reena Leone
I could have used that in my previous life when I did social commerce for Sony and managed their review platform. That would have made my life so much easier.
[00:12:10.750] – Rick Jacobs
Yeah, lots of use cases for this one also. I mean, this particular one I’m working on is churn analysis and then retention. But there’s medical use cases where you want to have the doctor have his information extremely readily available. So again, he comes in, a patient comes in, the patient has certain illnesses, certain characteristics that might suggest a particular recommended course of treatment, prescription, et cetera. The doctor can enter that information into his computer and then he gets an immediate recommendation. That’s something that’s being worked on. I remember one of the projects that I worked on in the past was something very similar to that. I’m not sure how we’ve progressed technically in that particular use case, but it’s another one where speed is very important and it’s important to be able to analyze that data very quickly.
[00:13:04.650] – Reena Leone
You bring up a good point when we’re talking about safety and security, right, and also health care. These are both very important examples. As AI systems become more and more a part of our everyday life, how important is it to make sure that they are aligned with the best human intentions and values?
[00:13:26.910] – Rick Jacobs
Yes, that’s a question I’ve been seen on and used quite a bit. From my perspective, there is a danger of AI becoming self aware and starting to make decisions that are in its best interest. But I think the main danger is having these technologies fall into bad hands, right? So you have somebody with some programming skills, might be the kid next door in his basement. And he’s using an AI model. By the way, they are AI models available for free. So one of them is GPT for all. And then there’s things like Chat GPT that you have to pay for depending on how much you use it. But the point is, it’s not extremely difficult to get your hands on an AI model. So a kid can get his hands on an AI model and start developing applications that he thinks are funny but we might think are dangerous.
[00:14:16.790] – Reena Leone
Can even be unintentional, right? So in my previous life, I talked a lot about ethical and responsible AI, and one of the key things was eliminating bias. So someone may be using a data set that’s highly biased, but not realize it because the machine only knows what it’s given, right? And so it’s just going to run off of the data that it’s presented. We see it sometimes. You even mentioned with Chat GPT kind of favoring itself a little bit, but it kind of depends on what you’re feeding it in the first place or where it’s pulling from. And I don’t know if AI is smart enough to tell what’s real information and what’s fake information and where biases may be. So that’s another concern.
[00:15:01.930] – Rick Jacobs
Yeah, that’s actually a very big concern, something we spend a lot of time in college, so I was doing this stuff in college too. Back then. AI models weren’t where they at now, right. They were just starting to learn stuff and becoming genetic type networks. But bias is certainly an issue because if the AI learns on bad data, as you mentioned before, it learns that bias that’s within the data set. How we addressed it back in those days is we kind of hard coded around it. So if we noticed that this bias was happening, we would adjust the results that the eye was getting to manage for bias. So we might weigh certain parameters less, for example, because we know those parameters have some bias implicit with them. I am not sure if that’s what they do with models like Chat GPT, but that’s how we tended to handle that type of stuff back years ago. I don’t want to say how many years, but years ago.
[00:15:57.800] – Reena Leone
No, you don’t have to say how many years, but at least people have been sort of working on that problem. But I do feel like AI is moving so fast that we haven’t really solved for that just yet.
[00:16:11.610] – Rick Jacobs
It’s a known problem. So, again, I’d have to go check and see how Chat GPT handles it. But it’s a known problem. I’m sure it’s something they found a way to manage.
[00:16:21.660] – Reena Leone
One thing that you could do if you want to work with Apache Druid and Chat GPT together is also create your own data set, right? If you just want to tinker with it and figure out how it works ahead of time. I know that Helmar Becker, who is on the show, creates his own data set of flight data. So that might be a way if you want to avoid that and just kind of play around with the two technologies is another thing that you can do.
[00:16:44.950] – Rick Jacobs
Yeah, I mean, creating your own data set is good, but the real issue with these models is the models are already trained. Chat GPT is already trained, so it’s trained on whatever data set it’s trained on. And you made a good point where hopefully that data set that it’s trained on is not biased. So creating your own data is helpful because you have an idea of what’s within the set that you created. But the whole idea is to train it initially on data that’s not biased in the first place.
[00:17:16.300] – Reena Leone
Are there any other use cases where you can use chat GPT and Druid?
[00:17:20.280] – Rick Jacobs
Tons. So we talked about the social media one, we talked about the NYPD one. We talked about data set where it’s full of patient information gathered by doctors and their doctors looking for a recommendation. There’s also brand monitoring. Some of the churn one I mentioned earlier, where you’re trying to monitor what people are saying about your brand. So Twitter could be a good way of doing that. You’re monitoring what people are saying on Twitter about Druid, for example, and then based on what they’re saying on Twitter about Druid, you might make certain decisions. So the use cases are almost endless. There’s a lot of situations where you want additional information that models like Chat GPT can provide.
[00:18:03.360] – Reena Leone
For sure and I feel like the speed in which you need to deal with data is only going to increase. And that’s actually one of the benefits of Druid, as you mentioned, that it is super fast and it can handle we always talk about, like, petabytes of data. You don’t have to have petabytes of data, but if you do have petabytes of data, it can handle that for you, right? Well, Rick, I mean, we feel like we covered a lot of examples here on Chat GPT and Apache Druid, and hopefully we don’t see AI become self aware anytime soon for the sake of humanity. But thank you so much for joining me today. This has been amazing. And as a reminder, Rick has an awesome blog post if you want to check that out with code snippets if you’re interested in doing something similar with sentiment analysis.
[00:18:56.740] – Reena Leone
And if you would like to learn more about anything we talked about on the show today, including Apache Druid, please visit druid.Apache.org or imply.io R
[00:19:08.030] – Reena Leone
Rick, thanks again for joining me again.
[00:19:10.360] – Rick Jacobs
Thanks for having me. It was a pleasure.
[00:19:12.040] – Reena Leone
Until next time, folks. Keep it real.
[00:19:14.440] – Rick Jacobs
Ciao.