IMDB Sentiment Analysis with Naive Bayes

A machine learning project that implements a Naive Bayes classifier from scratch to perform sentiment analysis on IMDB movie reviews. The classifier achieves 86% accuracy on the test set.

Overview

This project demonstrates how to build a sentiment analysis classifier using the Naive Bayes algorithm without relying on pre-built machine learning libraries. The implementation includes text preprocessing, feature extraction, model training, and evaluation.

Features

Complete Naive Bayes implementation from scratch
Text preprocessing pipeline (tokenization, stemming, stopword removal)
Frequency-based feature extraction
Laplace smoothing for handling unseen words
Model evaluation with accuracy metrics
Custom text prediction functionality

Dataset

The project uses the IMDB dataset for sentiment analysis in CSV format from Kaggle:

Training set: 40,000 movie reviews
Test set: 5,000 movie reviews
Labels: 0 (negative) and 1 (positive)
Balance: 50% positive, 50% negative reviews

Dataset source: IMDB Dataset Sentiment Analysis in CSV Format

Requirements

Python Version

Python 3.12.2 or higher

Dependencies

numpy
pandas
nltk
kagglehub

NLTK Data

The project requires the NLTK stopwords corpus, which is automatically downloaded when running the notebook.

Installation

Clone the repository:

git clone <repository-url>
cd NB

Install required packages:

pip install numpy pandas nltk kagglehub

Download NLTK stopwords (if not automatically downloaded):

import nltk
nltk.download('stopwords')

Set up Kaggle credentials for dataset download (if needed):
- Follow instructions at Kaggle API to set up authentication

Usage

Running the Notebook

Open Naive_bayes.ipynb in Jupyter Notebook or JupyterLab
Run all cells sequentially
The notebook will:
- Download the IMDB dataset from Kaggle
- Preprocess the text data
- Train the Naive Bayes classifier
- Evaluate on the test set
- Display accuracy results

Key Functions

Text Processing

process_text(text)

Processes raw text by removing URLs, hashtags, punctuation, stopwords, and applies stemming.

Word Frequency Counting

count_words(result, text, ys)

Builds a frequency dictionary mapping (word, sentiment) pairs to their occurrence counts.

Training

train_naive_bayes(freqs, train_x, train_y)

Trains the Naive Bayes classifier and returns logprior and loglikelihood parameters.

Prediction

naive_bayes_predict(text, logprior, loglikelihood)

Predicts sentiment for a given text string.

Testing

test_naive_bayes(test_x, test_y, logprior, loglikelihood)

Evaluates the classifier on test data and returns accuracy.

Methodology

Text Preprocessing Pipeline

URL Removal: Removes hyperlinks using regex
Hashtag Handling: Removes hashtag symbols while preserving words
Tokenization: Splits text into individual words using TweetTokenizer
Punctuation Removal: Strips punctuation from tokens
Alphabetic Filtering: Keeps only alphabetic words
Stopword Removal: Removes common English stopwords
Stemming: Reduces words to their root form using Porter Stemmer

Naive Bayes Algorithm

The classifier uses:

Log Prior: Log ratio of positive to negative document probabilities
Log Likelihood: Log ratio of word probabilities given each sentiment class
Laplace Smoothing: Adds 1 to word counts to handle unseen words
Prediction: Sums logprior and loglikelihoods for all words in a document

Model Performance

Test Accuracy: 86.06%
Vocabulary Size: 89,511 unique words
Training Samples: 40,000 reviews
Test Samples: 5,000 reviews

Project Structure

NB/
├── Naive_bayes.ipynb    # Main notebook with implementation
└── README.md            # Project documentation

Example Usage

# Train the model
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

# Predict sentiment for custom text
my_review = "I loved this movie! It was amazing."
score = naive_bayes_predict(my_review, logprior, loglikelihood)
sentiment = "Positive" if score > 0 else "Negative"
print(f"Predicted sentiment: {sentiment}")

Results

The model achieves 86.06% accuracy on the test set, demonstrating effective sentiment classification using the Naive Bayes approach. The implementation successfully handles:

Text preprocessing and normalization
Feature extraction from text data
Probabilistic classification
Handling of unseen words through smoothing

Technical Details

Algorithm Complexity

Training: O(V × D) where V is vocabulary size and D is number of documents
Prediction: O(W) where W is the number of words in the input text

Key Design Decisions

Use of log probabilities to prevent numerical underflow
Laplace smoothing to handle zero probabilities
Stemming to reduce vocabulary size and improve generalization
Stopword removal to focus on meaningful words

Limitations

Simple bag-of-words approach doesn't capture word order or context
No handling of negations (e.g., "not good")
Limited to binary classification (positive/negative)
Performance depends heavily on preprocessing choices

Future Improvements

Implement n-gram features to capture word sequences
Add negation handling
Experiment with different smoothing techniques
Extend to multi-class sentiment classification
Add feature importance visualization
Implement cross-validation for hyperparameter tuning

License

This project is provided as-is for educational purposes.

Acknowledgments

IMDB dataset provided by Kaggle
NLTK library for natural language processing tools
Kagglehub for dataset access

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMDB Sentiment Analysis with Naive Bayes

Overview

Features

Dataset

Requirements

Python Version

Dependencies

NLTK Data

Installation

Usage

Running the Notebook

Key Functions

Text Processing

Word Frequency Counting

Training

Prediction

Testing

Methodology

Text Preprocessing Pipeline

Naive Bayes Algorithm

Model Performance

Project Structure

Example Usage

Results

Technical Details

Algorithm Complexity

Key Design Decisions

Limitations

Future Improvements

License

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

IMDB Sentiment Analysis with Naive Bayes

Overview

Features

Dataset

Requirements

Python Version

Dependencies

NLTK Data

Installation

Usage

Running the Notebook

Key Functions

Text Processing

Word Frequency Counting

Training

Prediction

Testing

Methodology

Text Preprocessing Pipeline

Naive Bayes Algorithm

Model Performance

Project Structure

Example Usage

Results

Technical Details

Algorithm Complexity

Key Design Decisions

Limitations

Future Improvements

License

Acknowledgments