Disaster Tweet Classification with Traditional ML and LLMs

This project explores text classification using traditional machine learning models and large language models using the Google Gemini API. It aims to classify tweets into categories such as natural disasters (e.g., earthquake, flood, hurricane) or non-disaster-related content.

Project Overview (Traditional ML)

Data: The project uses the "Disaster Tweet Corpus 2020," a dataset of human-labeled tweets covering various disaster types, with an equal split of disaster-related and non-disaster-related tweets.
Preprocessing: Tweets are extensively cleaned and normalized, including removing URLs, special characters, and stopwords, followed by tokenization and lemmatization.
Baseline Evaluation: A Naive Bayes classifier with a Bag-of-Words (BoW) approach is implemented as the baseline, achieving a macro-averaged F1-score of 0.90.
Improved Models:
- Naive Bayes with TF-IDF: Incorporates TF-IDF vectorization but slightly underperforms compared to the baseline.
- Support Vector Classifier (SVC): Achieves significant improvement using TF-IDF preprocessing.
- Hyperparameter-Tuned SVC: Further optimization using grid search yields the best performance with a macro-averaged F1-score of 0.98.

Key Features

Baseline and Advanced Models: Establishes a baseline with Naive Bayes and improves performance with SVC and hyperparameter tuning.
Preprocessing: Demonstrates the impact of preprocessing techniques like TF-IDF and lemmatization on model performance.
Evaluation: Uses confusion matrices and classification reports to assess accuracy, precision, recall, and F1-scores.

Project Overview (LLM)

Data: The project uses the "Disaster Tweet Corpus 2020," a dataset of human-labeled tweets covering various disaster types.
Preprocessing: Tweets are cleaned and normalized for input into the model.
Baseline Evaluation: The Gemini API's pre-trained model is evaluated using zero-shot prompting and refined system instructions.
Fine-Tuning: A custom model is tuned using parameter-efficient fine-tuning (PEFT) to improve classification accuracy and reduce token usage.
Evaluation: The tuned model is tested on a subset of the dataset, comparing accuracy and token efficiency with the baseline.

Key Features

Fine-Tuning: Demonstrates how to fine-tune a Gemini model for text classification.
Prompt Engineering: Explores techniques to improve model responses using system instructions.
Efficiency: Highlights token savings and cost-effectiveness of tuned models.

How to Run (LLM classifier notebook)

Clone the repository:

git clone https://github.com/your-username/google_gen_ai.git
cd google_gen_ai

Install dependencies:
```
pip install -r requirements.txt
```
Set up your environment:
- Create a .env file in the project root.
- Add your Google Gemini API key:
```
GEMINI_API_KEY=your-api-key-here
```
Run the notebook llm_classifier.ipynb to preprocess data, fine-tune the model, and evaluate results.

Results

Baseline Accuracy: Evaluated using zero-shot prompting.
Tuned Model Accuracy: Improved classification accuracy with reduced token usage.

Future Work

Experiment with additional hyperparameter tuning.
Explore alternative preprocessing techniques.
Evaluate the model on larger datasets.

References

Disclaimer

This project uses the Google Gemini API, which is subject to Google's Terms of Service. Use of the API and associated services is governed by those terms, and not by this repository's license.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
README.md		README.md
coursework_text_classifier.ipynb		coursework_text_classifier.ipynb
llm_classifier.ipynb		llm_classifier.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disaster Tweet Classification with Traditional ML and LLMs

Project Overview (Traditional ML)

Key Features

Project Overview (LLM)

Key Features

How to Run (LLM classifier notebook)

Results

Future Work

References

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Disaster Tweet Classification with Traditional ML and LLMs

Project Overview (Traditional ML)

Key Features

Project Overview (LLM)

Key Features

How to Run (LLM classifier notebook)

Results

Future Work

References

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages