Skip to content

Latest commit

Β 

History

History
81 lines (68 loc) Β· 2.85 KB

File metadata and controls

81 lines (68 loc) Β· 2.85 KB

Sentiment Analysis with LSTM

This project implements a Sentiment Analysis model using LSTM (Long Short-Term Memory) networks on the IMDB movie reviews dataset.

πŸ“Œ Project Details

  • Objective: Classify movie reviews as positive or negative.
  • Tech Stack: Python, TensorFlow/Keras, Pandas, NumPy, Matplotlib, Scikit-learn.
  • Approach:
    • Preprocess the text data.
    • Tokenize and pad sequences.
    • Train an LSTM neural network to learn sentiment patterns.
    • Evaluate the model performance.

πŸ“Š Dataset

  • Name: IMDB Dataset of 50K Movie Reviews.
  • Source: Kaggle
  • Description:
    • 50,000 reviews labeled as positive or negative.
    • Balanced dataset (25,000 positive / 25,000 negative).

πŸ“ˆ Training Results

The model was trained for 5 epochs with the following performance:

  • Epoch 1: Accuracy = 51.48%, Loss = 0.6938, Val Accuracy = 54.51%, Val Loss = 0.6889
  • Epoch 2: Accuracy = 56.55%, Loss = 0.6736, Val Accuracy = 53.46%, Val Loss = 0.7004
  • Epoch 3: Accuracy = 56.17%, Loss = 0.6803, Val Accuracy = 58.44%, Val Loss = 0.6674
  • Epoch 4: Accuracy = 59.64%, Loss = 0.6568, Val Accuracy = 60.69%, Val Loss = 0.6646
  • Epoch 5: Accuracy = 72.97%, Loss = 0.5580, Val Accuracy = 82.66%, Val Loss = 0.4459 s Final Test Accuracy: 82.66%

βš™οΈ Workflow

  1. Data Loading: Import the dataset into Pandas DataFrame.
  2. Preprocessing:
    • Remove HTML tags and special characters.
    • Convert text to lowercase.
    • Map labels: positive β†’ 1, negative β†’ 0.
  3. Tokenization & Padding:
    • Convert words into integer sequences using Keras Tokenizer.
    • Pad sequences to ensure each review has the same length (200).
  4. Model Building:
    • Use Embedding + LSTM + Dense layers.
  5. Training:
    • Optimizer: Adam
    • Loss: Binary Crossentropy
    • Epochs: 5
    • Batch Size: 128
  6. Evaluation:
    • Measure accuracy on test set.
    • Plot training vs validation accuracy.
  7. Prediction: Test the model with custom reviews.

πŸ—οΈ Model Architecture

  • Embedding Layer: Input dim = 10,000, Output dim = 64, Input length = 200
  • LSTM Layer: 128 units
  • Dropout Layer: 0.5
  • Dense Layer: 64 units, ReLU activation
  • Dropout Layer: 0.3
  • Output Layer: 1 unit, Sigmoid activation

▢️ How to Run

  1. Clone or download this repository.
  2. Download the dataset and place IMDB Dataset.csv in the project folder.
  3. Install dependencies:
    pip install -r requirements.txt
    
  4. Run the script:
    python main.py
    

πŸš€ Future Improvements

  • Use pre-trained embeddings like GloVe.
  • Try BiLSTM/GRU architectures.
  • Experiment with transformers (BERT) for better accuracy.
  • Deploy the model as a web app using Flask or Streamlit.