This project implements a comprehensive sentiment analysis pipeline for Twitter data using the Sentiment140 dataset. The objective is to classify tweets as positive or negative using both classic machine learning and deep learning (CNN-LSTM) models.
-
Data Loading
- Download Sentiment140 and load into a pandas DataFrame
-
Preprocessing
- Drop unused columns
- Demojize tweets
- Expand contractions and slangs
- Clean text (lowercase, remove punctuation, etc.)
- Tokenize, remove stopwords, lemmatize
- Spelling correction
- Hashtag normalization
-
Vectorization & Embedding
- TF-IDF for classic ML
- Tokenizer + padded sequences for LSTM/CNN-LSTM
-
Model Training
- Train classic ML models for baseline
- Build and train a CNN-LSTM model with Keras
-
Evaluation
- Achieve 85% train / 80% test accuracy
- Evaluate with classification metrics
-
Live Prediction
- Load trained model and tokenizer
- Preprocess and predict sentiment for user-typed sentences
- Clone the repository and install dependencies
- Run preprocessing scripts to generate vectorized data
- Train the model using provided training scripts
- Use the live prediction script to classify new tweets
| Metric | Value |
|---|---|
| Train Accuracy | 85% |
| Test Accuracy | 80% |
| Model | CNN-LSTM |
| Dataset | Sentiment140 |
You can input new sentences and get instant sentiment predictions using the trained model.