lighthouse-labs · ChetnaEM · Dec 6, 2024 · Dec 11, 2024 · Dec 11, 2024 · Dec 11, 2024
diff --git a/Final Project - Description - UnsupervisedLearning.docx b/Final Project - Description - UnsupervisedLearning.docx
diff --git a/Final Project Rubric - Machine Learning.xlsx b/Final Project Rubric - Machine Learning.xlsx
diff --git a/README.md b/README.md
@@ -1,17 +1,19 @@
-# machine_learning_project-unsupervised-learning
+# Unsupervised Learning Project
 
-## Project Outcomes
-- Unsupervised Learning: perform unsupervised learning techniques on a wholesale data dataset. The project involves four main parts: exploratory data analysis and pre-processing, KMeans clustering, hierarchical clustering, and PCA.
-### Duration:
-Approximately 1 hour and 40 minutes
-### Project Description:
-In this project, we will apply unsupervised learning techniques to a real-world data set and use data visualization tools to communicate the insights gained from the analysis.
+## Project/Goals
+(fill in your project description and goals here)
 
-The data set for this project is the "Wholesale Data" dataset containing information about various products sold by a grocery store.
-The project will involve the following tasks:
+## Process
+### (fill in your step 1)
+- (fill in your step 1 details)
 
--	Exploratory data analysis and pre-processing: We will import and clean the data sets, analyze and visualize the relationships between the different variables, handle missing values and outliers, and perform feature engineering as needed.
--	Unsupervised learning: We will use the Wholesale Data dataset to perform k-means clustering, hierarchical clustering, and principal component analysis (PCA) to identify patterns and group similar data points together. We will determine the optimal number of clusters and communicate the insights gained through data visualization.
+### (fill in your step 2)
 
-The ultimate goal of the project is to gain insights from the data sets and communicate these insights to stakeholders using appropriate visualizations and metrics to make informed decisions based on the business questions asked."
+## Results
+- (fill in your answers to the [prompts in `assignment.md`](assignment.md#5b---results))
 
+## Challenges
+- (fill in your challenges ([prompt in `assignment.md`](assignment.md#5c---challenges)))
+
+## Future Goals
+- (what would you do if you had more time/data/resources? ([prompt in `assignment.md`](assignment.md#5d---future-goals)))
diff --git a/assignment.md b/assignment.md
@@ -0,0 +1,78 @@
+Welcome to the Unsupervised Learning project! In this project, you'll apply a few different Unsupervised Learning techniques to gain insights from the a real-world dataset. 
+
+## Context
+Clustering is a core method in unsupervised learning that allows us to uncover hidden patterns and groupings in data *without* relying on labels. In this project, you will explore different clustering techniques and compare their effectiveness. 
+Optionally, you'll use Principal Component Analysis (PCA) to reduce dimensionality and visualize the clusters you identify.
+
+## Learning Objectives
+By completing this project, you will have a chance to:
+- Refine and apply your skills in cleaning and preparing datasets for unsupervised learning
+- Practice applying clustering algorithms (K-Means, Hierarchical Clustering, DBSCAN)
+- Practice evaluating clustering performance using metrics and visualizations
+- (stretch) Learn some additional approaches for clustering visualization, including using PCA for dimensionality reduction, PCA biplots, and silhouette plots
+
+## Business Case
+Imagine you are a data scientist working for a company looking to segment its customer base. You have been given a dataset containing various customer metrics, such as business type, location, and spending history. Your task is to use clustering techniques to identify distinct customer groups and provide actionable insights based on these clusters.
+
+# Part 1 - EDA
+- Explore the dataset you've been given
+    - Look for missing values, outliers, inconsistencies, etc
+    - Examine the distribution of numerical and categorical features
+    - Normalize/standardize the data if required for clustering algorithms  
+Complete the `1 - EDA.ipynb` notebook and fill in **Step 1** in your `README.md` to demonstrate your completion of these tasks.
+
+# Part 2 - K-Means Clustering
+- Apply the K-Means algorithm to the dataset
+- Use the elbow method to determine the optimal number of clusters
+- Analyze the characteristics of the created clusters
+
+Complete the `2 - K-Means Clustering.ipynb` notebook and fill in **Step 2** in your `README.md` to demonstrate your completion of these tasks
+
+# Part 3 - Hierarchical Clustering
+- Perform hierarchical clustering on the dataset
+- Use a dendrogram to determine the optimal number of clusters
+- Analyze the characteristics of the created clusters
+Complete the `3 - Hierarchical Clustering.ipynb` notebook and fill in **Step 3** in your `README.md` to demonstrate your completion of these tasks
+
+# Part 4 - DBSCAN
+- Apply the DBSCAN algorithm to the dataset
+- Experiment with different hyperparameters to optimize clustering results
+- Identify (describe the features of) nay noise points (outliers) detected by the algorithm
+Complete the `4 - DBSCAN.ipynb` notebook and fill in **Step 4** in your `README.md` to demonstrate your completion of these tasks
+
+# Part 5 - Results and Evaluation
+#### 5a - Evaluation
+- Compare the results of K-Means, Hierarchical Clustering, and DBSCAN
+    - Use metrics such as Silhouette Score, Rand Score, Mutual Information Score, etc
+    - Compare the three clustering methods as evaluated by your chosen metrics
+    - Evaluate how well each method aligns with the business case, and draw a conclusion about the most appropriate clustering technique for the dataset
+    - Complete the `5 - Evaluation.ipynb` notebook to demonstrate your completion of these tasks
+#### 5b - Results
+- Answer the following prompts [in your `README.md`](README.md#results):
+    1. Briefly discuss the strengths and weaknesses of each clustering method for this specific dataset
+    2. Which clustering method performed best, and why?
+    3. Were there significant differences in the clusters identified by your preferred algorithm? Describe them
+    4. What actionable insights can be derived from the clustering results? 
+    <br> 
+    *eg:* 
+        - "frequent repeat customers tend to purchase multiple types of products" 
+        - "X large customer exhibits rare customer behavior and may require adjusted marketing strategy"
+        - "customers in cluster Y frequently purchase many of the same products as each other - a recommender system could be effective for marketing to them"
+#### 5c - Challenges
+- Describe challenges you faced while completing the project [in your `README.md`](README.md#challenges) (troubleshooting, understanding of material before project, timeframe, etc)
+
+#### 5d - Future Goals
+- Describe potential future goals for this project [in your `README.md`](README.md#future-goals) (you should definitely return to this project after you graduate!)
+
+
+# (stretch) Part 6 - PCA and Visualization
+
+### 6a - PCA
+- Apply PCA to reduce the dataset to two dimensions
+- Visualize clusters as 2D scatter plots using PCA-transformed data
+
+### 6b - PCA Biplots
+- Create a PCA biplot to analyze the relationships between dataset features and clusters
+
+### 6c - Silhouette Plots
+- Create [Silhouette Plots](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) to increase your understanding of your clusters' Silhouette Scores, and see if any of your opinions on the performance of the different clustering models change
diff --git a/Wholesale_Data.csv → data/Wholesale_Data.csv b/Wholesale_Data.csv → data/Wholesale_Data.csv
diff --git a/images/.gitkeep b/images/.gitkeep
diff --git a/notebooks/1 - EDA.ipynb b/notebooks/1 - EDA.ipynb
@@ -0,0 +1,127 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# EDA\n",
+    "**Goal:** prepare the dataset for clustering by identifying and addressing any data quality issues, and ensuring the data is properly scaled and formatted"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1. Understand dataset\n",
+    "- load the dataset and inspect its structure\n",
+    "- check datatypes and ensure they match the columns' intended uses"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# replace with imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2. Missing values\n",
+    "- check for missing values\n",
+    "- decide how to handle missing data (drop? fill? impute?)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3. Analyze feature distributions\n",
+    "- for numerical features\n",
+    "    - use histograms to visualize distributions\n",
+    "    - use box plots to identify outliers\n",
+    "- for categorical features\n",
+    "- use bar charts to understand frequencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4. Feature relationships\n",
+    "- calculate and visualize the correlation matrix for numerical features\n",
+    "- use scatter plots or pair plots to explore relationships between key/interesting features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5. Prep for clustering\n",
+    "- is Normalization or Standardization needed? if yes, use it for appropriate features\n",
+    "- do any categorical features require one-hot encoding or label encoding? if yes, do so"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 6. Save cleaned dataset\n",
+    "- save your cleaned and preprocessed dataset to an appropriate file type in your `data` directory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/2 - K-Means Clustering.ipynb b/notebooks/2 - K-Means Clustering.ipynb
@@ -0,0 +1,126 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# K-Means Clustering\n",
+    "**Goal:** Apply K-Means clustering to the dataset, determining the optimal number of clusters, and analyze cluster characteristics"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1. Load preprocessed dataset\n",
+    "- load the dataset cleaned/preprocessed in the previous notebook"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2. Apply K-Means\n",
+    "- run the K-Means algorithm with an initial range of `k` (eg 2-10)\n",
+    "- save the results for each value of `k`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3. Determine optimal number of clusters (elbow method)\n",
+    "- create a distortion plot from your saved clustering results\n",
+    "- identify the \"elbow point\" where adding more clusters yields diminishing returns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4. Analyze cluster characteristics\n",
+    "- assign cluster labels (from your chosen number of clusters) to the dataset\n",
+    "- examine the summary statistics of each cluster to identify distinguishing features (include these features in the K-Means step in your `README.md`)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5. Visualize clusters\n",
+    "- create scatter plots or pair plots to visualize the clusters "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 6. Save clustered data\n",
+    "- save the results of your clustering to a file for later comparison"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.4"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}