This project analyzes Amazon product reviews to detect fake reviews and perform customer segmentation using machine learning techniques.
- Project Overview
- Dependencies
- Data Preprocessing
- Customer Segmentation
- Fake Review Detection
- Results and Analysis
The project aims to:
- Analyze customer behavior through review patterns
- Segment customers based on their reviewing patterns
- Detect potentially fake reviews using unsupervised learning
- Perform sentiment analysis on reviews
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
- Text Processing:
# NLTK preprocessing
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
- Custom Functions:
def token_stop_pos(text):
tags = pos_tag(word_tokenize(text))
newlist = []
for word, tag in tags:
if word.lower() not in stop_words:
newlist.append(tuple([word, pos_dict.get(tag[0])]))
return newlist
- Feature Engineering:
- Review counts per customer
- Average expenditure
- Positive/negative review ratio
- Review length statistics
- K-means Clustering:
kmeans = KMeans(n_clusters=6, random_state=47)
clusters = kmeans.fit_predict(dfs1[columns])
- Text Vectorization:
tfidf = TfidfVectorizer(ngram_range=(1, 2), max_df=0.9)
text_features = tfidf.fit_transform(dfl['summaryreview_lemma'].values)
- Anomaly Detection:
from sklearn.ensemble import IsolationForest
isolation_forest = IsolationForest(contamination=0.1)
outlier_labels = isolation_forest.fit_predict(outlier_detection_df)
- Cluster 0: Moderate reviewers
- Cluster 1: Negative reviewers (potential fake)
- Cluster 2: High-volume reviewers
- Clusters 3-5: Various authentic patterns
- Extreme sentiment scores
- Unusual review lengths
- Irregular voting patterns
- Suspicious customer behavior
- Customer behavior patterns can effectively identify suspicious reviewing activity
- Combined analysis of text features and numerical metrics improves fake review detection
- Unsupervised learning techniques successfully segment customers and identify anomalies
- Include more features for analysis
- Implement supervised learning with labeled data
- Add real-time detection capabilities
- Enhance visualization techniques
For detailed implementation and code examples, please refer to the Jupyter notebook.
The dataset used in this project is not included in this repository.
👉 You can access the original dataset from the following source:
[Ni, J., Li, J., & McAuley, J. (2019, November). Justifying recommendations using distantly�labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 188-197).]
[Amazon Product Data by Julian McAuley in https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/]
If you're using a modified or custom-labeled version of the dataset, please contact the author for more information.
Zahra Hasannejad
🌐 GitHub: Zahra Hasannejad