Skip to content

Latest commit

 

History

History
205 lines (152 loc) · 5.93 KB

File metadata and controls

205 lines (152 loc) · 5.93 KB

IMDB Sentiment Analysis with Naive Bayes

A machine learning project that implements a Naive Bayes classifier from scratch to perform sentiment analysis on IMDB movie reviews. The classifier achieves 86% accuracy on the test set.

Overview

This project demonstrates how to build a sentiment analysis classifier using the Naive Bayes algorithm without relying on pre-built machine learning libraries. The implementation includes text preprocessing, feature extraction, model training, and evaluation.

Features

  • Complete Naive Bayes implementation from scratch
  • Text preprocessing pipeline (tokenization, stemming, stopword removal)
  • Frequency-based feature extraction
  • Laplace smoothing for handling unseen words
  • Model evaluation with accuracy metrics
  • Custom text prediction functionality

Dataset

The project uses the IMDB dataset for sentiment analysis in CSV format from Kaggle:

  • Training set: 40,000 movie reviews
  • Test set: 5,000 movie reviews
  • Labels: 0 (negative) and 1 (positive)
  • Balance: 50% positive, 50% negative reviews

Dataset source: IMDB Dataset Sentiment Analysis in CSV Format

Requirements

Python Version

Python 3.12.2 or higher

Dependencies

numpy
pandas
nltk
kagglehub

NLTK Data

The project requires the NLTK stopwords corpus, which is automatically downloaded when running the notebook.

Installation

  1. Clone the repository:
git clone <repository-url>
cd NB
  1. Install required packages:
pip install numpy pandas nltk kagglehub
  1. Download NLTK stopwords (if not automatically downloaded):
import nltk
nltk.download('stopwords')
  1. Set up Kaggle credentials for dataset download (if needed):
    • Follow instructions at Kaggle API to set up authentication

Usage

Running the Notebook

  1. Open Naive_bayes.ipynb in Jupyter Notebook or JupyterLab
  2. Run all cells sequentially
  3. The notebook will:
    • Download the IMDB dataset from Kaggle
    • Preprocess the text data
    • Train the Naive Bayes classifier
    • Evaluate on the test set
    • Display accuracy results

Key Functions

Text Processing

process_text(text)

Processes raw text by removing URLs, hashtags, punctuation, stopwords, and applies stemming.

Word Frequency Counting

count_words(result, text, ys)

Builds a frequency dictionary mapping (word, sentiment) pairs to their occurrence counts.

Training

train_naive_bayes(freqs, train_x, train_y)

Trains the Naive Bayes classifier and returns logprior and loglikelihood parameters.

Prediction

naive_bayes_predict(text, logprior, loglikelihood)

Predicts sentiment for a given text string.

Testing

test_naive_bayes(test_x, test_y, logprior, loglikelihood)

Evaluates the classifier on test data and returns accuracy.

Methodology

Text Preprocessing Pipeline

  1. URL Removal: Removes hyperlinks using regex
  2. Hashtag Handling: Removes hashtag symbols while preserving words
  3. Tokenization: Splits text into individual words using TweetTokenizer
  4. Punctuation Removal: Strips punctuation from tokens
  5. Alphabetic Filtering: Keeps only alphabetic words
  6. Stopword Removal: Removes common English stopwords
  7. Stemming: Reduces words to their root form using Porter Stemmer

Naive Bayes Algorithm

The classifier uses:

  • Log Prior: Log ratio of positive to negative document probabilities
  • Log Likelihood: Log ratio of word probabilities given each sentiment class
  • Laplace Smoothing: Adds 1 to word counts to handle unseen words
  • Prediction: Sums logprior and loglikelihoods for all words in a document

Model Performance

  • Test Accuracy: 86.06%
  • Vocabulary Size: 89,511 unique words
  • Training Samples: 40,000 reviews
  • Test Samples: 5,000 reviews

Project Structure

NB/
├── Naive_bayes.ipynb    # Main notebook with implementation
└── README.md            # Project documentation

Example Usage

# Train the model
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

# Predict sentiment for custom text
my_review = "I loved this movie! It was amazing."
score = naive_bayes_predict(my_review, logprior, loglikelihood)
sentiment = "Positive" if score > 0 else "Negative"
print(f"Predicted sentiment: {sentiment}")

Results

The model achieves 86.06% accuracy on the test set, demonstrating effective sentiment classification using the Naive Bayes approach. The implementation successfully handles:

  • Text preprocessing and normalization
  • Feature extraction from text data
  • Probabilistic classification
  • Handling of unseen words through smoothing

Technical Details

Algorithm Complexity

  • Training: O(V × D) where V is vocabulary size and D is number of documents
  • Prediction: O(W) where W is the number of words in the input text

Key Design Decisions

  • Use of log probabilities to prevent numerical underflow
  • Laplace smoothing to handle zero probabilities
  • Stemming to reduce vocabulary size and improve generalization
  • Stopword removal to focus on meaningful words

Limitations

  • Simple bag-of-words approach doesn't capture word order or context
  • No handling of negations (e.g., "not good")
  • Limited to binary classification (positive/negative)
  • Performance depends heavily on preprocessing choices

Future Improvements

  • Implement n-gram features to capture word sequences
  • Add negation handling
  • Experiment with different smoothing techniques
  • Extend to multi-class sentiment classification
  • Add feature importance visualization
  • Implement cross-validation for hyperparameter tuning

License

This project is provided as-is for educational purposes.

Acknowledgments

  • IMDB dataset provided by Kaggle
  • NLTK library for natural language processing tools
  • Kagglehub for dataset access