For a better understanding of the project, please check the Google Colab file π uploaded in this repository. It contains detailed explanations and execution steps to help you grasp the workflow more effectively.
This repository contains a sentiment analysis project using Natural Language Processing (NLP) and a Naive Bayes classifier to classify restaurant reviews as positive π or negative π.
- π The dataset consists of restaurant reviews stored in a TSV file.
- π§Ή Text preprocessing is performed to clean and prepare the data.
- π A Bag of Words (BoW) model is used to convert text data into numerical format.
- π€ A Naive Bayes classifier is trained on the dataset to perform sentiment classification.
- π Model evaluation is done using a confusion matrix and accuracy score.
- π Python
- ποΈ Pandas
- π’ NumPy
- π Matplotlib
- π NLTK (Natural Language Toolkit)
- π€ Scikit-learn
Ensure you have Python installed and set up a virtual environment (optional but recommended).
- π Clone this repository:
git clone https://github.com/yourusername/restaurant-review-nlp.git cd restaurant-review-nlp - π¦ Install dependencies:
pip install -r requirements.txt
- π₯ Download the necessary NLTK stopwords:
import nltk nltk.download('stopwords')
Run the script to preprocess the dataset, train the Naive Bayes model, and evaluate performance:
python sentiment_analysis.pyThe dataset used is Restaurant_Reviews.tsv, which contains:
- π£οΈ A column
Reviewwith customer reviews. - β
A column
Liked(1 for positive, 0 for negative sentiment).
- π₯ Load Dataset: Read the
Restaurant_Reviews.tsvfile. - π§Ό Text Cleaning & Preprocessing:
- Remove special characters, convert text to lowercase.
- Remove stopwords (except negations like "not").
- Apply stemming using
PorterStemmer.
- π Feature Extraction:
- Use
CountVectorizerto create a Bag of Words model. - Convert text into a numerical matrix representation.
- Use
- π Train-Test Split:
- 80% training, 20% testing.
- π€ Train Model:
- Train a Multinomial Naive Bayes classifier.
- π Evaluate Model:
- Predict test data.
- Compute accuracy score and confusion matrix.
The script prints:
- π© Confusion matrix for training and test datasets.
- π― Accuracy score of the classifier.
Feel free to fork this repository, submit issues, and contribute with improvements! π
This project is open-source and available under the MIT License.