NOVA IMS — MSc in Data Science & Advanced Analytics · Text Mining 2025
Predict market sentiment (bearish=0, bullish=1, neutral=2) from tweets about stocks. We benchmark classic ML baselines (Count/TF-IDF + linear models) and a transformer encoder, selecting DistilBERT as the final model (best Macro-F1).
- Train: tweets with
textandlabel(0/1/2) · Test: tweets withtextonly. - Deliverables:
tm_tests_19.ipynb,tm_final_19.ipynb,pred_19.csv, and a ≤10-page PDF report.
(See the official handout for details and deadline.)
- EDA: label distribution, token length, top words/bigrams after normalization.
- Preprocessing: regex cleaning, lowercasing, stopword removal, lemmatization/stemming.
- Features: TF-IDF (unigrams–trigrams), transformer embeddings (DistilBERT).
- Models: LinearSVC / Logistic Regression baselines; DistilBERT classifier (final).
- Evaluation: Accuracy, Precision, Recall, Macro-F1 on validation.
- DistilBERT (Transformer): 0.82 ← final model
- TF-IDF (1–3) + LinearSVC: 0.73
- TF-IDF (1–2) + LinearSVC: 0.73
(Seereports/All_Models_MacroF1_Table.docxfor the full table.)
Open notebooks/tm_final_19.ipynb and run all cells to produce submission/pred_19.csv.
notebooks/ — notebooks (tm_tests_19.ipynb, tm_final_19.ipynb)
reports/ — Text Mining Report.pdf, All_Models_MacroF1_Table.docx, Project Handout TM 2025 v2.pdf
submission/ — pred_19.csv, optional metrics
André Oliveira · Diogo Andrade · Francisco Pontes