Skip to content

Simon project overhaul #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Binary file removed Final Project Rubric - Machine Learning.xlsx
Binary file not shown.
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
# machine_learning_project-unsupervised-learning
# Unsupervised Learning Project

## Project Outcomes
- Unsupervised Learning: perform unsupervised learning techniques on a wholesale data dataset. The project involves four main parts: exploratory data analysis and pre-processing, KMeans clustering, hierarchical clustering, and PCA.
### Duration:
Approximately 1 hour and 40 minutes
### Project Description:
In this project, we will apply unsupervised learning techniques to a real-world data set and use data visualization tools to communicate the insights gained from the analysis.
## Project/Goals
(fill in your project description and goals here)

The data set for this project is the "Wholesale Data" dataset containing information about various products sold by a grocery store.
The project will involve the following tasks:
## Process
### (fill in your step 1)
- (fill in your step 1 details)

- Exploratory data analysis and pre-processing: We will import and clean the data sets, analyze and visualize the relationships between the different variables, handle missing values and outliers, and perform feature engineering as needed.
- Unsupervised learning: We will use the Wholesale Data dataset to perform k-means clustering, hierarchical clustering, and principal component analysis (PCA) to identify patterns and group similar data points together. We will determine the optimal number of clusters and communicate the insights gained through data visualization.
### (fill in your step 2)

The ultimate goal of the project is to gain insights from the data sets and communicate these insights to stakeholders using appropriate visualizations and metrics to make informed decisions based on the business questions asked."
## Results
- (fill in your answers to the [prompts in `assignment.md`](assignment.md#5b---results))

## Challenges
- (fill in your challenges ([prompt in `assignment.md`](assignment.md#5c---challenges)))

## Future Goals
- (what would you do if you had more time/data/resources? ([prompt in `assignment.md`](assignment.md#5d---future-goals)))
78 changes: 78 additions & 0 deletions assignment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
Welcome to the Unsupervised Learning project! In this project, you'll apply a few different Unsupervised Learning techniques to gain insights from the a real-world dataset.

## Context
Clustering is a core method in unsupervised learning that allows us to uncover hidden patterns and groupings in data *without* relying on labels. In this project, you will explore different clustering techniques and compare their effectiveness.
Optionally, you'll use Principal Component Analysis (PCA) to reduce dimensionality and visualize the clusters you identify.

## Learning Objectives
By completing this project, you will have a chance to:
- Refine and apply your skills in cleaning and preparing datasets for unsupervised learning
- Practice applying clustering algorithms (K-Means, Hierarchical Clustering, DBSCAN)
- Practice evaluating clustering performance using metrics and visualizations
- (stretch) Learn some additional approaches for clustering visualization, including using PCA for dimensionality reduction, PCA biplots, and silhouette plots

## Business Case
Imagine you are a data scientist working for a company looking to segment its customer base. You have been given a dataset containing various customer metrics, such as business type, location, and spending history. Your task is to use clustering techniques to identify distinct customer groups and provide actionable insights based on these clusters.

# Part 1 - EDA
- Explore the dataset you've been given
- Look for missing values, outliers, inconsistencies, etc
- Examine the distribution of numerical and categorical features
- Normalize/standardize the data if required for clustering algorithms
Complete the `1 - EDA.ipynb` notebook and fill in **Step 1** in your `README.md` to demonstrate your completion of these tasks.

# Part 2 - K-Means Clustering
- Apply the K-Means algorithm to the dataset
- Use the elbow method to determine the optimal number of clusters
- Analyze the characteristics of the created clusters

Complete the `2 - K-Means Clustering.ipynb` notebook and fill in **Step 2** in your `README.md` to demonstrate your completion of these tasks

# Part 3 - Hierarchical Clustering
- Perform hierarchical clustering on the dataset
- Use a dendrogram to determine the optimal number of clusters
- Analyze the characteristics of the created clusters
Complete the `3 - Hierarchical Clustering.ipynb` notebook and fill in **Step 3** in your `README.md` to demonstrate your completion of these tasks

# Part 4 - DBSCAN
- Apply the DBSCAN algorithm to the dataset
- Experiment with different hyperparameters to optimize clustering results
- Identify (describe the features of) nay noise points (outliers) detected by the algorithm
Complete the `4 - DBSCAN.ipynb` notebook and fill in **Step 4** in your `README.md` to demonstrate your completion of these tasks

# Part 5 - Results and Evaluation
#### 5a - Evaluation
- Compare the results of K-Means, Hierarchical Clustering, and DBSCAN
- Use metrics such as Silhouette Score, Rand Score, Mutual Information Score, etc
- Compare the three clustering methods as evaluated by your chosen metrics
- Evaluate how well each method aligns with the business case, and draw a conclusion about the most appropriate clustering technique for the dataset
- Complete the `5 - Evaluation.ipynb` notebook to demonstrate your completion of these tasks
#### 5b - Results
- Answer the following prompts [in your `README.md`](README.md#results):
1. Briefly discuss the strengths and weaknesses of each clustering method for this specific dataset
2. Which clustering method performed best, and why?
3. Were there significant differences in the clusters identified by your preferred algorithm? Describe them
4. What actionable insights can be derived from the clustering results?
<br>
*eg:*
- "frequent repeat customers tend to purchase multiple types of products"
- "X large customer exhibits rare customer behavior and may require adjusted marketing strategy"
- "customers in cluster Y frequently purchase many of the same products as each other - a recommender system could be effective for marketing to them"
#### 5c - Challenges
- Describe challenges you faced while completing the project [in your `README.md`](README.md#challenges) (troubleshooting, understanding of material before project, timeframe, etc)

#### 5d - Future Goals
- Describe potential future goals for this project [in your `README.md`](README.md#future-goals) (you should definitely return to this project after you graduate!)


# (stretch) Part 6 - PCA and Visualization

### 6a - PCA
- Apply PCA to reduce the dataset to two dimensions
- Visualize clusters as 2D scatter plots using PCA-transformed data

### 6b - PCA Biplots
- Create a PCA biplot to analyze the relationships between dataset features and clusters

### 6c - Silhouette Plots
- Create [Silhouette Plots](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) to increase your understanding of your clusters' Silhouette Scores, and see if any of your opinions on the performance of the different clustering models change
File renamed without changes.
Empty file added images/.gitkeep
Empty file.
127 changes: 127 additions & 0 deletions notebooks/1 - EDA.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# EDA\n",
"**Goal:** prepare the dataset for clustering by identifying and addressing any data quality issues, and ensuring the data is properly scaled and formatted"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Understand dataset\n",
"- load the dataset and inspect its structure\n",
"- check datatypes and ensure they match the columns' intended uses"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# replace with imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Missing values\n",
"- check for missing values\n",
"- decide how to handle missing data (drop? fill? impute?)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Analyze feature distributions\n",
"- for numerical features\n",
" - use histograms to visualize distributions\n",
" - use box plots to identify outliers\n",
"- for categorical features\n",
"- use bar charts to understand frequencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Feature relationships\n",
"- calculate and visualize the correlation matrix for numerical features\n",
"- use scatter plots or pair plots to explore relationships between key/interesting features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. Prep for clustering\n",
"- is Normalization or Standardization needed? if yes, use it for appropriate features\n",
"- do any categorical features require one-hot encoding or label encoding? if yes, do so"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Save cleaned dataset\n",
"- save your cleaned and preprocessed dataset to an appropriate file type in your `data` directory"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
126 changes: 126 additions & 0 deletions notebooks/2 - K-Means Clustering.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# K-Means Clustering\n",
"**Goal:** Apply K-Means clustering to the dataset, determining the optimal number of clusters, and analyze cluster characteristics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Load preprocessed dataset\n",
"- load the dataset cleaned/preprocessed in the previous notebook"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Apply K-Means\n",
"- run the K-Means algorithm with an initial range of `k` (eg 2-10)\n",
"- save the results for each value of `k`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Determine optimal number of clusters (elbow method)\n",
"- create a distortion plot from your saved clustering results\n",
"- identify the \"elbow point\" where adding more clusters yields diminishing returns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Analyze cluster characteristics\n",
"- assign cluster labels (from your chosen number of clusters) to the dataset\n",
"- examine the summary statistics of each cluster to identify distinguishing features (include these features in the K-Means step in your `README.md`)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. Visualize clusters\n",
"- create scatter plots or pair plots to visualize the clusters "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Save clustered data\n",
"- save the results of your clustering to a file for later comparison"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12.4"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading