This project performs a comparative performance study of clustering algorithms using the Wholesale Customers dataset from the UCI Machine Learning Repository. The analysis leverages different preprocessing techniques, varying cluster sizes, and multiple evaluation metrics to determine the most effective clustering configuration.
- Name: Wholesale Customers Dataset
- Source: UCI Machine Learning Repository
- Number of Features: 7
- Number of Records: 440
- Description: The dataset contains annual spending in monetary units on various product categories for customers from a wholesale distributor.
- K-Means
- Hierarchical Clustering (HCLUST)
- MeanShift
- No Processing
- Normalization
- Transformation
- PCA
- Transformation + Normalization (T+N)
- Transformation + Normalization + PCA (T+N+PCA)
- 3 clusters
- 4 clusters
- 5 clusters
- Silhouette Score
- Calinski-Harabasz Index
- Davies-Bouldin Score
| Metric | Best Value |
|---|---|
| Best Algorithm | MEANSHIFT |
| Best Clusters | 3 |
| Best Silhouette | 0.9076 |
All evaluations were performed using the PyCaret library.
All model evaluations are also visualized using grouped bar plots for:
- Silhouette Score
- Calinski-Harabasz Index
- Davies-Bouldin Score
Each model's performance across different preprocessing techniques and cluster sizes is presented.
clustering_results.csv– Final result table with all configurations- Saved plots for each metric/model
- Jupyter Notebook / Colab Notebook for reproducibility