This repository contains the code to perform multi-class emotion classification and topic modeling on stock-related tweets using various NLP techniques.
Link to the StockEmotion Dataset.
We investigate:
- Emotion classification using TF-IDF, Word2Vec, and contextual embeddings (BERTweet, Distil-RoBERTa)
- Topic modeling with LDA and BERTopic
- Integration of emoji features and sentiment lexicons (VADER, NRC, Bing Liu)
For full details, see the project report.
| Notebook | Description |
|---|---|
dataset_exploration.ipynb |
Exploratory Data Analysis (EDA) |
TM_preproc_FE_classification.ipynb |
Preprocessing, feature extraction, training of classifiers |
word2vec_multi.ipynb |
Word2Vec training and downstream classification |
BERT_embeddings_and_classifier.ipynb |
Embedding generation (BERTweet, Distil-RoBERTa) + classifier |
topic_modeling_LDA.ipynb |
LDA-based topic modeling |
topic_modeling_BERTopic.ipynb |
BERTopic-based topic modeling |
Note: Pre-trained models (e.g., Word2Vec) and large files are excluded from the repo.
This project uses Python 3.12. It's recommended to create a virtual environment to manage dependencies.
Unix/macOS:
python3.12 -m venv venv
source venv/bin/activateWindows:
py -3.12 -m venv venv
venv\Scripts\activateStart by upgrading pip:
pip install --upgrade pipInstall required packages manually:
pip install bertopic gensim nltk spacy emoji vaderSentimentAdditional packages may be required based on notebook usage (e.g., scikit-learn, matplotlib, pandas, umap-learn).
After installing:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import spacy
spacy.cli.download("en_core_web_sm")- Lexicons: Bing Liu and NRC Emotion Lexicons (update paths accordingly).
- Processed CSV: Required before running
topic_modeling_BERTopic.ipynb. - Pre-trained Word2Vec: Must be trained separately or loaded from local storage.
- Best classifier: XGBoost with Bigram TF-IDF (macro-F1 ≈ 0.32)
- Topic modeling coherence: ~0.3, indicating challenges due to tweet brevity/noise
- Emoji features proved critical in both classification and topic modeling
- Robin Smith
- Sergio Verga
Università degli Studi di Milano-Bicocca
For methodology and evaluation, see:
TM_report_Smith_Verga.pdf.

