This project provides a fully reproducible pipeline for text-based emotion detection using BERT and modern NLP preprocessing. Includes:
- Dataset merging
- Advanced preprocessing (including negation handling)
- Automatic negation-based data augmentation
- Model training and interactive/manual inference (CLI or web)
- Unified data preparation script: Combines, preprocesses, and augments emotion datasets.
- Flexible CSV input: Accepts and merges standard emotion datasets (e.g.
train.csv,test.csv,val.csv). - Preprocessing: Lemmatization, proper stopword handling, explicit negation marking.
- Augmentation: Adds robust negated phrase samples for each emotion class to boost performance on tricky text.
- BERT workflow: Scripted, interactive, and web app interfaces for batch or real-time emotion detection.
├── data/
│ ├── train.csv
│ ├── test.csv
│ ├── val.csv
│ ├── final_data.csv # Optional: raw/merged input
│ └── final_data_aug.csv # Output: run after prepare_data.py
├── prepare_data.py # Data pipeline: load, preprocess, augment
├── bert_emotion.py # Model training & CLI inference
├── requirements.txt
├── README.md
└── bert_emotion_model/
└── ... (saved model after training)
pip install -r requirements.txt
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
Run the unified script to preprocess and augment data: python prepare_data.py
- Output:
data/final_data_aug.csvfor model training.
Train the BERT emotion classifier interactively: python bert_emotion.py
- Choose
1for model training. - The trained model and tokenizer are saved in
bert_emotion_model/.
python bert_emotion.py
- Choose
2for interactive detection. - Type sentences and press Enter for results.
- If you have
app.py(Streamlit), run:
streamlit run app.py
pandas
nltk
torch
transformers
scikit-learn
accelerate
Add streamlit if you demo as a web app.
- Add or tune emotion labels in
prepare_data.pyas needed. - Full GPU support if PyTorch w/ CUDA is installed.
- For other languages/datasets, adapt preprocessing and label mapping.
MIT License (or specify your license)
Questions? PRs and issues welcome!