This project explores text classification using traditional machine learning models and large language models using the Google Gemini API. It aims to classify tweets into categories such as natural disasters (e.g., earthquake, flood, hurricane) or non-disaster-related content.
- Data: The project uses the "Disaster Tweet Corpus 2020," a dataset of human-labeled tweets covering various disaster types, with an equal split of disaster-related and non-disaster-related tweets.
- Preprocessing: Tweets are extensively cleaned and normalized, including removing URLs, special characters, and stopwords, followed by tokenization and lemmatization.
- Baseline Evaluation: A Naive Bayes classifier with a Bag-of-Words (BoW) approach is implemented as the baseline, achieving a macro-averaged F1-score of 0.90.
- Improved Models:
- Naive Bayes with TF-IDF: Incorporates TF-IDF vectorization but slightly underperforms compared to the baseline.
- Support Vector Classifier (SVC): Achieves significant improvement using TF-IDF preprocessing.
- Hyperparameter-Tuned SVC: Further optimization using grid search yields the best performance with a macro-averaged F1-score of 0.98.
-
Baseline and Advanced Models: Establishes a baseline with Naive Bayes and improves performance with SVC and hyperparameter tuning.
-
Preprocessing: Demonstrates the impact of preprocessing techniques like TF-IDF and lemmatization on model performance.
-
Evaluation: Uses confusion matrices and classification reports to assess accuracy, precision, recall, and F1-scores.
- Data: The project uses the "Disaster Tweet Corpus 2020," a dataset of human-labeled tweets covering various disaster types.
- Preprocessing: Tweets are cleaned and normalized for input into the model.
- Baseline Evaluation: The Gemini API's pre-trained model is evaluated using zero-shot prompting and refined system instructions.
- Fine-Tuning: A custom model is tuned using parameter-efficient fine-tuning (PEFT) to improve classification accuracy and reduce token usage.
- Evaluation: The tuned model is tested on a subset of the dataset, comparing accuracy and token efficiency with the baseline.
- Fine-Tuning: Demonstrates how to fine-tune a Gemini model for text classification.
- Prompt Engineering: Explores techniques to improve model responses using system instructions.
- Efficiency: Highlights token savings and cost-effectiveness of tuned models.
-
Clone the repository:
git clone https://github.com/your-username/google_gen_ai.git cd google_gen_ai -
Install dependencies:
pip install -r requirements.txt
-
Set up your environment:
- Create a
.envfile in the project root. - Add your Google Gemini API key:
GEMINI_API_KEY=your-api-key-here
- Create a
-
Run the notebook
llm_classifier.ipynbto preprocess data, fine-tune the model, and evaluate results.
- Baseline Accuracy: Evaluated using zero-shot prompting.
- Tuned Model Accuracy: Improved classification accuracy with reduced token usage.
- Experiment with additional hyperparameter tuning.
- Explore alternative preprocessing techniques.
- Evaluate the model on larger datasets.
This project uses the Google Gemini API, which is subject to Google's Terms of Service. Use of the API and associated services is governed by those terms, and not by this repository's license.