Fake Review Detector (NLP + Streamlit)

A machine learning project that detects fake vs real product reviews using TF-IDF vectorization, Logistic Regression, and behavioral text features such as exclamation count, sentiment, and repeated promotional phrases.
It also includes a sleek Streamlit app for interactive real-time predictions.

Features

Text cleaning and normalization pipeline
Hybrid feature extraction:
- TF-IDF (1–2 grams)
- Numeric sentiment & behavioral features
Interpretable Logistic Regression model
Evaluation metrics: Confusion Matrix, ROC, and PR curves
Interactive Streamlit app with adjustable decision threshold

Folder Structure

fake-review-detector/
├── app/
│   └── streamlit_app.py
├── src/
│   ├── clean_text.py
│   ├── features.py
│   ├── train.py
│   └── predict.py
├── data/
│   └── reviews_sample.csv
├── outputs/
│   ├── pipeline.joblib
│   ├── confusion_matrix.png
│   ├── roc_curve.png
│   └── pr_curve.png
├── requirements.txt
└── README.md

How It Works

Data Input: CSV containing text and label columns.
Preprocessing: URL, punctuation, and HTML removal + lowercasing.
Feature Engineering:
- Sentiment score
- Exclamation & ALL-CAPS detection
- Fake-review clichés (e.g., “best product ever”)
Modeling: Logistic Regression trained on combined features.
Prediction: Threshold-tunable classification for FAKE vs REAL.

Streamlit Interface

Below is a preview of the web app UI built with Streamlit:

Screenshot 2025-10-30 at 11-15-11 Fake Review Detector

Highlights

Paste or type any review text.
Adjust decision threshold for sensitivity.
Get immediate prediction with fake probability.
Built-in tips to help interpret the model.

Model Evaluation

Confusion Matrix

Precision-Recall Curve

ROC Curve

The model achieves AUC ≈ 1.00 and AP ≈ 1.00 on sample data (balanced, synthetic).

Setup & Usage

python -m venv .venv
# Activate
.venv\Scripts\activate  # (Windows)
# source .venv/bin/activate  # (macOS/Linux)

pip install -r requirements.txt

# Train the model
python src/train.py --csv data/reviews_sample.csv --outdir outputs

# Predict a single review
python src/predict.py --pipeline outputs/pipeline.joblib --text "I got this for free, best product ever!!!"

# Launch the app
streamlit run app/streamlit_app.py

Insights

Excessive punctuation, emotional exaggeration, or ALL-CAPS usage strongly correlates with fake reviews.
Real reviews tend to include neutral tone and product-specific feedback.
The combination of linguistic + behavioral features improves reliability over text-only models.

Future Improvements

Integrate a larger, real-world labeled dataset.
Replace TF-IDF with contextual embeddings (BERT/SentenceTransformer).
Deploy via Streamlit Cloud or Hugging Face Spaces.
Add explainability (SHAP/LIME) for feature-level insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake Review Detector (NLP + Streamlit)

Features

Folder Structure

How It Works