Customer segmentation using K-Means clustering on mall customer data.
This project applies the K-Means++ algorithm to segment mall customers based on their annual income and spending score. The elbow method is used to determine the optimal number of clusters (k = 5), and results are visualized as a 2D scatter plot.
Implementations are provided in both Python and R.
Mall_Customers.csv — 200 records with the following columns:
| Column | Description |
|---|---|
| CustomerID | Unique identifier |
| Genre | Gender (Male / Female) |
| Age | Customer age |
| Annual Income (k$) | Yearly income in thousands |
| Spending Score (1-100) | Score assigned by the mall |
- Load the dataset and select features (Annual Income, Spending Score)
- Run K-Means for k = 1..10 and record WCSS (Within-Cluster Sum of Squares)
- Plot the elbow curve to identify optimal k
- Fit K-Means++ with k = 5 and visualize the resulting clusters
| Tool | Purpose |
|---|---|
| 🐍 Python 3 | Primary implementation |
| 📊 scikit-learn | KMeans clustering |
| 🔢 NumPy | Numerical operations |
| 📈 Matplotlib | Visualization |
| 🐼 pandas | Data loading |
| 📉 R | Alternative implementation |
# Install dependencies
pip install numpy pandas matplotlib scikit-learn
# Run the clustering
python kmeans.pyFor the R version:
# Requires: cluster package
Rscript kmeans.R├── kmeans.py # Python K-Means implementation
├── kmeans.R # R K-Means implementation
├── data_preprocessing_template.py # Generic preprocessing template (Python)
├── data_preprocessing_template.R # Generic preprocessing template (R)
├── Mall_Customers.csv # Dataset
└── README.md
data_preprocessing_template.pyreferences a genericData.csvthat is not included — it is a reusable template, not specific to this project.- The R script shadows the built-in
kmeansfunction by assigning the result to a variable namedkmeans.
