A machine learning project that implements a Naive Bayes classifier from scratch to perform sentiment analysis on IMDB movie reviews. The classifier achieves 86% accuracy on the test set.
This project demonstrates how to build a sentiment analysis classifier using the Naive Bayes algorithm without relying on pre-built machine learning libraries. The implementation includes text preprocessing, feature extraction, model training, and evaluation.
- Complete Naive Bayes implementation from scratch
- Text preprocessing pipeline (tokenization, stemming, stopword removal)
- Frequency-based feature extraction
- Laplace smoothing for handling unseen words
- Model evaluation with accuracy metrics
- Custom text prediction functionality
The project uses the IMDB dataset for sentiment analysis in CSV format from Kaggle:
- Training set: 40,000 movie reviews
- Test set: 5,000 movie reviews
- Labels: 0 (negative) and 1 (positive)
- Balance: 50% positive, 50% negative reviews
Dataset source: IMDB Dataset Sentiment Analysis in CSV Format
Python 3.12.2 or higher
numpy
pandas
nltk
kagglehub
The project requires the NLTK stopwords corpus, which is automatically downloaded when running the notebook.
- Clone the repository:
git clone <repository-url>
cd NB- Install required packages:
pip install numpy pandas nltk kagglehub- Download NLTK stopwords (if not automatically downloaded):
import nltk
nltk.download('stopwords')- Set up Kaggle credentials for dataset download (if needed):
- Follow instructions at Kaggle API to set up authentication
- Open
Naive_bayes.ipynbin Jupyter Notebook or JupyterLab - Run all cells sequentially
- The notebook will:
- Download the IMDB dataset from Kaggle
- Preprocess the text data
- Train the Naive Bayes classifier
- Evaluate on the test set
- Display accuracy results
process_text(text)Processes raw text by removing URLs, hashtags, punctuation, stopwords, and applies stemming.
count_words(result, text, ys)Builds a frequency dictionary mapping (word, sentiment) pairs to their occurrence counts.
train_naive_bayes(freqs, train_x, train_y)Trains the Naive Bayes classifier and returns logprior and loglikelihood parameters.
naive_bayes_predict(text, logprior, loglikelihood)Predicts sentiment for a given text string.
test_naive_bayes(test_x, test_y, logprior, loglikelihood)Evaluates the classifier on test data and returns accuracy.
- URL Removal: Removes hyperlinks using regex
- Hashtag Handling: Removes hashtag symbols while preserving words
- Tokenization: Splits text into individual words using TweetTokenizer
- Punctuation Removal: Strips punctuation from tokens
- Alphabetic Filtering: Keeps only alphabetic words
- Stopword Removal: Removes common English stopwords
- Stemming: Reduces words to their root form using Porter Stemmer
The classifier uses:
- Log Prior: Log ratio of positive to negative document probabilities
- Log Likelihood: Log ratio of word probabilities given each sentiment class
- Laplace Smoothing: Adds 1 to word counts to handle unseen words
- Prediction: Sums logprior and loglikelihoods for all words in a document
- Test Accuracy: 86.06%
- Vocabulary Size: 89,511 unique words
- Training Samples: 40,000 reviews
- Test Samples: 5,000 reviews
NB/
├── Naive_bayes.ipynb # Main notebook with implementation
└── README.md # Project documentation
# Train the model
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
# Predict sentiment for custom text
my_review = "I loved this movie! It was amazing."
score = naive_bayes_predict(my_review, logprior, loglikelihood)
sentiment = "Positive" if score > 0 else "Negative"
print(f"Predicted sentiment: {sentiment}")The model achieves 86.06% accuracy on the test set, demonstrating effective sentiment classification using the Naive Bayes approach. The implementation successfully handles:
- Text preprocessing and normalization
- Feature extraction from text data
- Probabilistic classification
- Handling of unseen words through smoothing
- Training: O(V × D) where V is vocabulary size and D is number of documents
- Prediction: O(W) where W is the number of words in the input text
- Use of log probabilities to prevent numerical underflow
- Laplace smoothing to handle zero probabilities
- Stemming to reduce vocabulary size and improve generalization
- Stopword removal to focus on meaningful words
- Simple bag-of-words approach doesn't capture word order or context
- No handling of negations (e.g., "not good")
- Limited to binary classification (positive/negative)
- Performance depends heavily on preprocessing choices
- Implement n-gram features to capture word sequences
- Add negation handling
- Experiment with different smoothing techniques
- Extend to multi-class sentiment classification
- Add feature importance visualization
- Implement cross-validation for hyperparameter tuning
This project is provided as-is for educational purposes.
- IMDB dataset provided by Kaggle
- NLTK library for natural language processing tools
- Kagglehub for dataset access