This project performs comprehensive customer segmentation analysis on an online retail dataset using machine learning clustering techniques and RFM (Recency, Frequency, Monetary) analysis. The goal is to identify distinct customer segments to drive better customer relationship management strategies and business insights.
Source: Online Retail Dataset (Excel format)
- Records: ~540K transactions
- Time Period: 2010-2011
- Primary Market: United Kingdom (with international customers)
- Features:
- InvoiceNo: Transaction identifier
- StockCode: Product code
- Description: Product description
- Quantity: Product quantity
- InvoiceDate: Transaction date
- UnitPrice: Product price
- CustomerID: Customer identifier
- Country: Customer location
- Data Loading: Import and initial exploration of the retail dataset
- Data Cleaning:
- Handle missing values (135K missing CustomerIDs)
- Fix negative quantities by converting to absolute values
- Remove zero unit prices
- Detect and remove outliers (Quantity > 20K, UnitPrice > 10K)
- Feature Engineering:
- Calculate profit (Quantity × UnitPrice)
- Extract year and month from transaction dates
- Text preprocessing for product descriptions
- Temporal Analysis:
- Yearly and monthly profit trends
- Peak sales periods identification
- Geographic Analysis:
- Country-wise order distribution
- Revenue analysis by geography
- Customer distribution across countries
- Product Analysis:
- Top and least selling products
- Price distribution analysis
- Customer Behavior Analysis:
- Highest spending customers
- Most frequent buyers
- Purchase patterns
- Recency: Days since last purchase
- Frequency: Number of transactions
- Monetary: Total amount spent
-
K-Means Clustering
- Optimal cluster determination using Elbow Method
- Silhouette Score analysis
- Final Model: 5 clusters with StandardScaler preprocessing
-
Hierarchical Clustering
- Agglomerative clustering with Ward linkage
- Dimensionality reduction using t-SNE
- Euclidean distance metric
The analysis identifies 5 distinct customer segments based on purchasing behavior:
- Champions: High recency, frequency, and monetary values
- Loyal Customers: High frequency and monetary, moderate recency
- Potential Loyalists: Recent customers with good monetary value
- New Customers: Recent but low frequency/monetary
- At-Risk: Low recency, previously high-value customers
- 3D scatter plots for RFM visualization
- Cluster distribution plots
- Geographic heatmaps
- Temporal trend analysis
- Customer segment characteristics
- Market Dominance: UK represents ~90% of transactions and revenue
- Seasonal Patterns: Strong seasonality with peaks in certain months
- Customer Concentration: Small percentage of customers drive majority of revenue
- Geographic Diversity: 37 countries represented, with significant international presence
- Product Portfolio: Wide variety of products with clear bestsellers
- pandas, numpy: Data manipulation
- matplotlib, seaborn: Data visualization
- scikit-learn: Machine learning algorithms
- sklearn.preprocessing.StandardScaler: Feature scaling
- sklearn.cluster: K-Means and Agglomerative clustering
- sklearn.decomposition.PCA: Dimensionality reduction
- sklearn.manifold.TSNE: Non-linear dimensionality reduction- Optimal Clusters: 5 (determined via Elbow Method and Silhouette Analysis)
- Silhouette Score: Optimized for best cluster separation
- Preprocessing: StandardScaler for feature normalization
- Personalized Campaigns: Targeted marketing based on customer segments
- Retention Programs: Focus on at-risk customers
- Loyalty Programs: Reward champions and loyal customers
- Cross-selling Opportunities: Based on customer segment preferences
- Pricing Strategy: Segment-specific pricing models
- Inventory Management: Stock optimization based on customer demand patterns
- Personalization: Customized product recommendations
- Service Levels: Differentiated service based on customer value
- Communication Strategy: Segment-appropriate messaging
-
Additional Data Sources:
- Demographics (age, gender, income)
- Psychographics (lifestyle, personality)
- Geographic granularity (city-level analysis)
-
Advanced Analytics:
- Customer Lifetime Value (CLV) modeling
- Churn prediction models
- Recommendation systems
-
Real-time Segmentation:
- Dynamic customer segment updates
- Real-time personalization engine
- Automated marketing triggers
- K-Means Sensitivity: Requires pre-defining cluster count
- Outlier Impact: Sensitive to extreme values despite preprocessing
- Static Analysis: Segments based on historical data only
- Missing Demographics: Limited customer profile information
CS434.ipynb: Main Jupyter notebook containing complete analysisOnline Retail dataset.xlsx: Source dataset (not included in repo)Project Goal.pdf: Project objectives and requirementsREADME.md: This documentation file
-
Environment Setup:
pip install pandas numpy matplotlib seaborn scikit-learn
-
Data Preparation:
- Download the Online Retail dataset
- Update file paths in the notebook
- Mount Google Drive (if using Colab)
-
Execution:
- Run cells sequentially in the Jupyter notebook
- Adjust parameters as needed for your specific use case
Course: CS434 - Machine Learning/Data Science
Analysis Period: 2010-2011 Online Retail Data
Implementation: Python with scikit-learn
This project demonstrates practical application of unsupervised machine learning techniques for business intelligence and customer analytics, providing actionable insights for retail strategy optimization.