Concise, production-minded pipeline for classifying disaster-related tweets.
Kaggle notebook : TweetNLP Pipeline — EDA, BERTweet & LightGBM ensemble
Detect whether a tweet describes a real-world disaster event (binary classification). The model is developed for the Kaggle challenge "nlp-getting-started", aiming for robust generalization from cross-validated training to the competition test split.
Files
train.csv— training settest.csv— test setsample_submission.csv— sample file in submission format
Columns
id— unique identifier for each tweettext— tweet contentlocation— reported location (may be blank)keyword— extracted keyword (may be blank)target— (train only)1if tweet describes a disaster, else0
Minimal tweet-aware cleaning → BERTweet fine-tuning with Stratified K-Fold (OOF probs) → TF–IDF + engineered features → LightGBM (stacking with OOF) → Weighted ensemble (default 0.7 BERTweet / 0.3 LGBM).
- Kaggle private LB:
0.84462— Rank: 16
- Base model:
vinai/bertweet-base(fine-tuned for sequence classification) - Tokenizer: BERTweet tokenizer (normalization enabled)
- Stacking: OOF probabilities from BERTweet appended to TF–IDF + engineered feature matrix
- Meta model: LightGBM (gradient-boosted trees)
- Ensemble: Weighted average of BERTweet and LightGBM probabilities
Aiming for a reproducible, competitive pipeline that balances contextual modeling and interpretable features:
- Preserve Twitter artifacts (hashtags, mentions, emojis) during cleaning
- Use Stratified K-Fold to produce leakage-free stacking features (OOF)
- Combine deep contextual signals (BERTweet) with lexical/statistical features (TF–IDF, counts)
- Optimize ensemble weighting to control the precision/recall trade-off
NLP | transformers | BERTweet | LightGBM | stacking | feature-engineering | Kaggle | text-classification | data-science | machine-learning | ensemble-learning | nlp-preprocessing | text-mining | pytorch | huggingface | model-stacking