A machine learning project focused on detecting toxic vs non-toxic tweets using custom NLP models.
The goal of this project is not only classification performance, but understanding model behavior under severe class imbalance and making data-driven decisions using appropriate evaluation metrics.
Online toxicity detection is a challenging NLP task due to:
- informal language
- sarcasm and context dependence
- severe class imbalance
This project evaluates multiple baseline models and emphasizes minority-class performance rather than misleading aggregate accuracy.
- Total tweets: 31,962 real-world Twitter posts
- Non-toxic (0): 29,720 (~93%)
- Toxic (1): 2,242 (~7%)
The dataset is highly imbalanced, making accuracy an unreliable metric for model selection.
The following classical machine learning models were trained and compared:
- Logistic Regression
- Naive Bayes
- Support Vector Machine (SVM)
All models were trained using sparse text representations (TF-IDF).
Due to class imbalance, evaluation focused on:
- Confusion matrices
- Class-wise precision, recall, and F1-score
- Minority-class (toxic) recall and F1
- Threshold tuning using predicted probabilities (Logistic Regression)
Accuracy was reported but not used for model selection.
-
Logistic Regression and SVM performed best overall due to:
- calibrated probability outputs
- stable behavior on sparse, high-dimensional text features
- flexibility in threshold tuning under class imbalance
-
Naive Bayes underperformed due to its conditional independence assumption, which is poorly suited to contextual toxicity detection.
- Accuracy may be misleading for imbalanced classification tasks
- Simple, well-tuned models can perform competitively for NLP classification tasks
- Model selection should be driven by problem-specific costs, not single metrics
- Python
- scikit-learn
- NumPy
- Pandas
- matplotlib
- seaborn
To better understand linguistic patterns in toxic content, spaCy NER was applied to tweets labeled as toxic.
Entity extraction revealed that the most frequent entity types in toxic tweets were:
- PERSON
- ORG
- CARDINAL
- MONEY
This suggests that toxic language frequently targets individuals and organizations, and often references numerical or monetary contexts, which may correlate with harassment, threats, or disputes.
NER analysis was used for exploratory insight and interpretability, not for classification features.