Democratizing Text Analysis? A Critical Evaluation of Google's Gemini Pro for Sentiment Classification

Github LogoGithub Repository
sentiment-analysis.ai project preview screenshot

Project Overview

Technologies Used:

R Icon

R Programing Language

Role: Technical Writer & Statistical Programmer

Timeline: March 2024 - May 2024

15-paged paper on the evaluation of Google's Gemini Pro model for sentiment classification tasks via statistical programming in R.

Design & Development Process

After reading Steve Rathje’s and colleagues' paper “GPT is an effective tool for multilingual psychological text analysis” our group came together as a collective to pursue a similar project and contribute to further research on whether other LLMs outside of ChatGPT can meet the same performance when it came to sentiment analysis. I first began looking through publicly available datasets to benchmark the AI against human-labeled ground truth, and discovered the Appen dataset, in which crowdworkers classified tweets. Given that there were 13 sentiment categories, I began by identifying the research methodology:

  • Systematic sampling (every 10th tweet)
  • Gemini 1.0 Pro API via R
  • Temperature minimized for consistency
  • Threshold: minimum 400 queries set

As a team of three, we collectively provided at least one statistical analysis involving:

  • Confusion matrix for accuracy measurement
  • Chi-squared test for statistical significance
  • Cohen’s Kappa for adjusted accuracy given imbalanced data
  • R packages: dplyr, tidyr, ggplot2, caret

Results & Learning

After running our statistical analyses, we were shocked to find that the overall accuracy was 26.19%, barely above the No Information Rate (accuracy of where the most common sentiment in the dataset becomes chosen) baseline. Given that Gemini’s performance barely exceeds the NIR baseline by not randomly guessing (as evidenced by p &lt 0.001), it provides minimal practical value over simply predicting the most common sentiment for every tweet. Therefore, this implies that the model systematically confuses similar emotions, such as sadness, worry, and anger. Therefore, we found that Gemini Pro 1.0 was far worse than GPT-4’s 60-80%+ performance in comparable studies.