A machine learning project that detects fake vs real product reviews using TF-IDF vectorization, Logistic Regression, and behavioral text features such as exclamation count, sentiment, and repeated promotional phrases.
It also includes a sleek Streamlit app for interactive real-time predictions.
- Text cleaning and normalization pipeline
- Hybrid feature extraction:
- TF-IDF (1–2 grams)
- Numeric sentiment & behavioral features
- Interpretable Logistic Regression model
- Evaluation metrics: Confusion Matrix, ROC, and PR curves
- Interactive Streamlit app with adjustable decision threshold
fake-review-detector/
├── app/
│ └── streamlit_app.py
├── src/
│ ├── clean_text.py
│ ├── features.py
│ ├── train.py
│ └── predict.py
├── data/
│ └── reviews_sample.csv
├── outputs/
│ ├── pipeline.joblib
│ ├── confusion_matrix.png
│ ├── roc_curve.png
│ └── pr_curve.png
├── requirements.txt
└── README.md
- Data Input: CSV containing
textandlabelcolumns. - Preprocessing: URL, punctuation, and HTML removal + lowercasing.
- Feature Engineering:
- Sentiment score
- Exclamation & ALL-CAPS detection
- Fake-review clichés (e.g., “best product ever”)
- Modeling: Logistic Regression trained on combined features.
- Prediction: Threshold-tunable classification for FAKE vs REAL.
Below is a preview of the web app UI built with Streamlit:
- Paste or type any review text.
- Adjust decision threshold for sensitivity.
- Get immediate prediction with fake probability.
- Built-in tips to help interpret the model.
The model achieves AUC ≈ 1.00 and AP ≈ 1.00 on sample data (balanced, synthetic).
python -m venv .venv
# Activate
.venv\Scripts\activate # (Windows)
# source .venv/bin/activate # (macOS/Linux)
pip install -r requirements.txt
# Train the model
python src/train.py --csv data/reviews_sample.csv --outdir outputs
# Predict a single review
python src/predict.py --pipeline outputs/pipeline.joblib --text "I got this for free, best product ever!!!"
# Launch the app
streamlit run app/streamlit_app.py- Excessive punctuation, emotional exaggeration, or ALL-CAPS usage strongly correlates with fake reviews.
- Real reviews tend to include neutral tone and product-specific feedback.
- The combination of linguistic + behavioral features improves reliability over text-only models.
- Integrate a larger, real-world labeled dataset.
- Replace TF-IDF with contextual embeddings (BERT/SentenceTransformer).
- Deploy via Streamlit Cloud or Hugging Face Spaces.
- Add explainability (SHAP/LIME) for feature-level insights.