
Technologies Used:
R Programing Language
Role: Technical Writer & Statistical Programmer
Timeline: March 2024 - May 2024
15-paged paper on the evaluation of Google's Gemini Pro model for sentiment classification tasks via statistical programming in R.
After reading Steve Rathje’s and colleagues' paper “GPT is an effective tool for multilingual psychological text analysis” our group came together as a collective to pursue a similar project and contribute to further research on whether other LLMs outside of ChatGPT can meet the same performance when it came to sentiment analysis. I first began looking through publicly available datasets to benchmark the AI against human-labeled ground truth, and discovered the Appen dataset, in which crowdworkers classified tweets. Given that there were 13 sentiment categories, I began by identifying the research methodology:
As a team of three, we collectively provided at least one statistical analysis involving:
After running our statistical analyses, we were shocked to find that the overall accuracy was 26.19%, barely above the No Information Rate (accuracy of where the most common sentiment in the dataset becomes chosen) baseline. Given that Gemini’s performance barely exceeds the NIR baseline by not randomly guessing (as evidenced by p < 0.001), it provides minimal practical value over simply predicting the most common sentiment for every tweet. Therefore, this implies that the model systematically confuses similar emotions, such as sadness, worry, and anger. Therefore, we found that Gemini Pro 1.0 was far worse than GPT-4’s 60-80%+ performance in comparable studies.