diff --git a/Final Project - Description - UnsupervisedLearning.docx b/Final Project - Description - UnsupervisedLearning.docx
deleted file mode 100644
index dc6df9e..0000000
Binary files a/Final Project - Description - UnsupervisedLearning.docx and /dev/null differ
diff --git a/Final Project Rubric - Machine Learning.xlsx b/Final Project Rubric - Machine Learning.xlsx
deleted file mode 100644
index dabe978..0000000
Binary files a/Final Project Rubric - Machine Learning.xlsx and /dev/null differ
diff --git a/README.md b/README.md
index 82cef74..89280da 100644
--- a/README.md
+++ b/README.md
@@ -1,17 +1,19 @@
-# machine_learning_project-unsupervised-learning
+# Unsupervised Learning Project
-## Project Outcomes
-- Unsupervised Learning: perform unsupervised learning techniques on a wholesale data dataset. The project involves four main parts: exploratory data analysis and pre-processing, KMeans clustering, hierarchical clustering, and PCA.
-### Duration:
-Approximately 1 hour and 40 minutes
-### Project Description:
-In this project, we will apply unsupervised learning techniques to a real-world data set and use data visualization tools to communicate the insights gained from the analysis.
+## Project/Goals
+(fill in your project description and goals here)
-The data set for this project is the "Wholesale Data" dataset containing information about various products sold by a grocery store.
-The project will involve the following tasks:
+## Process
+### (fill in your step 1)
+- (fill in your step 1 details)
-- Exploratory data analysis and pre-processing: We will import and clean the data sets, analyze and visualize the relationships between the different variables, handle missing values and outliers, and perform feature engineering as needed.
-- Unsupervised learning: We will use the Wholesale Data dataset to perform k-means clustering, hierarchical clustering, and principal component analysis (PCA) to identify patterns and group similar data points together. We will determine the optimal number of clusters and communicate the insights gained through data visualization.
+### (fill in your step 2)
-The ultimate goal of the project is to gain insights from the data sets and communicate these insights to stakeholders using appropriate visualizations and metrics to make informed decisions based on the business questions asked."
+## Results
+- (fill in your answers to the [prompts in `assignment.md`](assignment.md#5b---results))
+## Challenges
+- (fill in your challenges ([prompt in `assignment.md`](assignment.md#5c---challenges)))
+
+## Future Goals
+- (what would you do if you had more time/data/resources? ([prompt in `assignment.md`](assignment.md#5d---future-goals)))
\ No newline at end of file
diff --git a/assignment.md b/assignment.md
new file mode 100644
index 0000000..d91a311
--- /dev/null
+++ b/assignment.md
@@ -0,0 +1,78 @@
+Welcome to the Unsupervised Learning project! In this project, you'll apply a few different Unsupervised Learning techniques to gain insights from the a real-world dataset.
+
+## Context
+Clustering is a core method in unsupervised learning that allows us to uncover hidden patterns and groupings in data *without* relying on labels. In this project, you will explore different clustering techniques and compare their effectiveness.
+Optionally, you'll use Principal Component Analysis (PCA) to reduce dimensionality and visualize the clusters you identify.
+
+## Learning Objectives
+By completing this project, you will have a chance to:
+- Refine and apply your skills in cleaning and preparing datasets for unsupervised learning
+- Practice applying clustering algorithms (K-Means, Hierarchical Clustering, DBSCAN)
+- Practice evaluating clustering performance using metrics and visualizations
+- (stretch) Learn some additional approaches for clustering visualization, including using PCA for dimensionality reduction, PCA biplots, and silhouette plots
+
+## Business Case
+Imagine you are a data scientist working for a company looking to segment its customer base. You have been given a dataset containing various customer metrics, such as business type, location, and spending history. Your task is to use clustering techniques to identify distinct customer groups and provide actionable insights based on these clusters.
+
+# Part 1 - EDA
+- Explore the dataset you've been given
+ - Look for missing values, outliers, inconsistencies, etc
+ - Examine the distribution of numerical and categorical features
+ - Normalize/standardize the data if required for clustering algorithms
+Complete the `1 - EDA.ipynb` notebook and fill in **Step 1** in your `README.md` to demonstrate your completion of these tasks.
+
+# Part 2 - K-Means Clustering
+- Apply the K-Means algorithm to the dataset
+- Use the elbow method to determine the optimal number of clusters
+- Analyze the characteristics of the created clusters
+
+Complete the `2 - K-Means Clustering.ipynb` notebook and fill in **Step 2** in your `README.md` to demonstrate your completion of these tasks
+
+# Part 3 - Hierarchical Clustering
+- Perform hierarchical clustering on the dataset
+- Use a dendrogram to determine the optimal number of clusters
+- Analyze the characteristics of the created clusters
+Complete the `3 - Hierarchical Clustering.ipynb` notebook and fill in **Step 3** in your `README.md` to demonstrate your completion of these tasks
+
+# Part 4 - DBSCAN
+- Apply the DBSCAN algorithm to the dataset
+- Experiment with different hyperparameters to optimize clustering results
+- Identify (describe the features of) nay noise points (outliers) detected by the algorithm
+Complete the `4 - DBSCAN.ipynb` notebook and fill in **Step 4** in your `README.md` to demonstrate your completion of these tasks
+
+# Part 5 - Results and Evaluation
+#### 5a - Evaluation
+- Compare the results of K-Means, Hierarchical Clustering, and DBSCAN
+ - Use metrics such as Silhouette Score, Rand Score, Mutual Information Score, etc
+ - Compare the three clustering methods as evaluated by your chosen metrics
+ - Evaluate how well each method aligns with the business case, and draw a conclusion about the most appropriate clustering technique for the dataset
+ - Complete the `5 - Evaluation.ipynb` notebook to demonstrate your completion of these tasks
+#### 5b - Results
+- Answer the following prompts [in your `README.md`](README.md#results):
+ 1. Briefly discuss the strengths and weaknesses of each clustering method for this specific dataset
+ 2. Which clustering method performed best, and why?
+ 3. Were there significant differences in the clusters identified by your preferred algorithm? Describe them
+ 4. What actionable insights can be derived from the clustering results?
+
+ *eg:*
+ - "frequent repeat customers tend to purchase multiple types of products"
+ - "X large customer exhibits rare customer behavior and may require adjusted marketing strategy"
+ - "customers in cluster Y frequently purchase many of the same products as each other - a recommender system could be effective for marketing to them"
+#### 5c - Challenges
+- Describe challenges you faced while completing the project [in your `README.md`](README.md#challenges) (troubleshooting, understanding of material before project, timeframe, etc)
+
+#### 5d - Future Goals
+- Describe potential future goals for this project [in your `README.md`](README.md#future-goals) (you should definitely return to this project after you graduate!)
+
+
+# (stretch) Part 6 - PCA and Visualization
+
+### 6a - PCA
+- Apply PCA to reduce the dataset to two dimensions
+- Visualize clusters as 2D scatter plots using PCA-transformed data
+
+### 6b - PCA Biplots
+- Create a PCA biplot to analyze the relationships between dataset features and clusters
+
+### 6c - Silhouette Plots
+- Create [Silhouette Plots](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) to increase your understanding of your clusters' Silhouette Scores, and see if any of your opinions on the performance of the different clustering models change
\ No newline at end of file
diff --git a/Wholesale_Data.csv b/data/Wholesale_Data.csv
similarity index 100%
rename from Wholesale_Data.csv
rename to data/Wholesale_Data.csv
diff --git a/images/.gitkeep b/images/.gitkeep
new file mode 100644
index 0000000..e69de29
diff --git a/notebooks/1 - EDA.ipynb b/notebooks/1 - EDA.ipynb
new file mode 100644
index 0000000..542235d
--- /dev/null
+++ b/notebooks/1 - EDA.ipynb
@@ -0,0 +1,127 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# EDA\n",
+ "**Goal:** prepare the dataset for clustering by identifying and addressing any data quality issues, and ensuring the data is properly scaled and formatted"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Understand dataset\n",
+ "- load the dataset and inspect its structure\n",
+ "- check datatypes and ensure they match the columns' intended uses"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# replace with imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Missing values\n",
+ "- check for missing values\n",
+ "- decide how to handle missing data (drop? fill? impute?)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3. Analyze feature distributions\n",
+ "- for numerical features\n",
+ " - use histograms to visualize distributions\n",
+ " - use box plots to identify outliers\n",
+ "- for categorical features\n",
+ "- use bar charts to understand frequencies"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4. Feature relationships\n",
+ "- calculate and visualize the correlation matrix for numerical features\n",
+ "- use scatter plots or pair plots to explore relationships between key/interesting features"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 5. Prep for clustering\n",
+ "- is Normalization or Standardization needed? if yes, use it for appropriate features\n",
+ "- do any categorical features require one-hot encoding or label encoding? if yes, do so"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6. Save cleaned dataset\n",
+ "- save your cleaned and preprocessed dataset to an appropriate file type in your `data` directory"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/2 - K-Means Clustering.ipynb b/notebooks/2 - K-Means Clustering.ipynb
new file mode 100644
index 0000000..4281fbd
--- /dev/null
+++ b/notebooks/2 - K-Means Clustering.ipynb
@@ -0,0 +1,126 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# K-Means Clustering\n",
+ "**Goal:** Apply K-Means clustering to the dataset, determining the optimal number of clusters, and analyze cluster characteristics"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Load preprocessed dataset\n",
+ "- load the dataset cleaned/preprocessed in the previous notebook"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Apply K-Means\n",
+ "- run the K-Means algorithm with an initial range of `k` (eg 2-10)\n",
+ "- save the results for each value of `k`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3. Determine optimal number of clusters (elbow method)\n",
+ "- create a distortion plot from your saved clustering results\n",
+ "- identify the \"elbow point\" where adding more clusters yields diminishing returns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4. Analyze cluster characteristics\n",
+ "- assign cluster labels (from your chosen number of clusters) to the dataset\n",
+ "- examine the summary statistics of each cluster to identify distinguishing features (include these features in the K-Means step in your `README.md`)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 5. Visualize clusters\n",
+ "- create scatter plots or pair plots to visualize the clusters "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6. Save clustered data\n",
+ "- save the results of your clustering to a file for later comparison"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.12.4"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/3 - Hierarchical Clustering.ipynb b/notebooks/3 - Hierarchical Clustering.ipynb
new file mode 100644
index 0000000..49d5cf4
--- /dev/null
+++ b/notebooks/3 - Hierarchical Clustering.ipynb
@@ -0,0 +1,113 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Hierarchical Clustering\n",
+ "**Goal:** Use hierarchical clustering to identify groupings in the dataset, and analyze the characteristics of the resuting clusters."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Load preprocessed dataset\n",
+ "- load the dataset prepared in the EDA notebook"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Perform hierarchical clustering\n",
+ "- try agglomorative clustering with a few different linkage methods (eg average, complete, ward, etc)\n",
+ "- generate dendograms to visualize the clustering hierarchy\n",
+ "- choose the linkage method that produces a dendogram with clear, meaningful splits"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3. Determine optimal number of clusters\n",
+ "- use the dendogram with your chosen linkage method to decide on a number of clusters (longest vertical line that doesn't cross a horizontal split)\n",
+ "- add your chosen number of clusters to your agglomorative clustering model hyperparameters"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4. Analyze cluster characteristics\n",
+ "- assign cluster labels to the dataset\n",
+ "- examine the summary statistics of each cluster to identify distinguishing features (include these features in the Hierarchical Clustering step in your `README.md`)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 5. Visualize\n",
+ "- create scatter plots, pair plots, and/or a dendogram to visualize final clusters"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6. Save clustered data\n",
+ "- save the results of your clustering to a file for later comparison"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.12.4"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/4 - DBSCAN.ipynb b/notebooks/4 - DBSCAN.ipynb
new file mode 100644
index 0000000..de2cf5a
--- /dev/null
+++ b/notebooks/4 - DBSCAN.ipynb
@@ -0,0 +1,119 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# DBSCAN\n",
+ "**Goal:** Apply DBSCAN, identify noise points, and interpret clustering results"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Load preprocessed dataset\n",
+ "- load the dataset prepared in the EDA notebook"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Run DBSCAN\n",
+ "- experiment with different values of `epsilon` and `min_values`\n",
+ " - start with defaults and adjust based on clustering results"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3. Identify noise points\n",
+ "- highlight data point labeled as noise (`-1`)\n",
+ "- examine the characteristics of noise points to determine their relevance"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4. Analyze cluster characteristics\n",
+ "- assign DBSCAN cluster labels to the dataset\n",
+ "- examine the key features of each cluster (add these to the DBSCAN step in your `README.md`)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 5. Visualize\n",
+ "- create scatter plots or pair plots to visualize clusters and noise points"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6. Save clustered data\n",
+ "- save the results of your clustering to a file for later comparison"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "name": "python",
+ "version": "3.12.4"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/5 - Evaluation.ipynb b/notebooks/5 - Evaluation.ipynb
new file mode 100644
index 0000000..03f3a04
--- /dev/null
+++ b/notebooks/5 - Evaluation.ipynb
@@ -0,0 +1,76 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Evaluation\n",
+ "**Goal:** Compare the performance of K-Means, Hierarchical Clustering, and DBSCAN, and draw a conclusion aout the best method for the dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Load clustered data\n",
+ "- load the 3 datasets containing the results from the three different clustering notebooks"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Evaluate clustering methods\n",
+ "- Calculate evaluation metrics for each clustering method\n",
+ " - [Silhouette Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)\n",
+ " - [Rand Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.rand_score.html)\n",
+ " - [Mutual Information Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html#sklearn.metrics.mutual_info_score)\n",
+ " - others you find relevant in the [SciKit Learn Documentation](https://scikit-learn.org/1.5/api/sklearn.metrics.html#module-sklearn.metrics.cluster)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3. Compare results\n",
+ "- create a [table](https://www.markdownguide.org/extended-syntax/#tables) with the evaluation metrics for all three clustering methods ([here's a handy site](https://www.tablesgenerator.com/markdown_tables) that may help construct your table in markdown)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "(*Move on to steps 5b, 5c, 5d in `assignment.md`*)"
+ ]
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/6 - (stretch) PCA and visualization.ipynb b/notebooks/6 - (stretch) PCA and visualization.ipynb
new file mode 100644
index 0000000..473a871
--- /dev/null
+++ b/notebooks/6 - (stretch) PCA and visualization.ipynb
@@ -0,0 +1,159 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# PCA and (Extra) Visualization\n",
+ "**Goal:** Use Principal Component Analysis (PCA) for dimensionality reduction for visualizations, and evaluate cluster performance using PCA biplots and silouette plots\n",
+ "***"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6a - PCA for visualization\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Apply PCA\n",
+ "- fit a PCA model on the dataset to reduce it to two principal components\n",
+ "- transform the dataset to 2D with the PCA model"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Visualize clusters in 2D\n",
+ "- create scatter plots of the PCA-transformed data\n",
+ "- colour the points based on the clustering labels from each method\n",
+ "- add K-Means cluster centroids (for practice)\n",
+ "- add legends, titles, etc\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6b - PCA Biplots"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Understand PCA biplots\n",
+ "- review the following resources as needed to understand PCA Biplots\n",
+ " - [How to Read PCA Biplots and Screen Plots - BioTuring](https://bioturing.medium.com/how-to-read-pca-biplots-and-scree-plots-186246aae063)\n",
+ " - [Principal Components: BiPlots - Forrest Young's Notes](https://www.uv.es/visualstats/vista-frames/help/lecturenotes/lecture13/biplot.html)\n",
+ " - [How to Make PCA Biplots in Python - JC Chouinard](https://www.jcchouinard.com/python-pca-biplots-machine-learning/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "2. Create a PCA biplot\n",
+ "- plot the PCA-reduced dataset as a scatter plot (first as 2D)\n",
+ "- add feature vectors (loadings) as arrows pointing in the driection of their contribution\n",
+ "- annotate feature vectors and PCs (axes in this case)\n",
+ "- optionally, make a 3D biplot as described farther down in [this tutorial](https://www.jcchouinard.com/python-pca-biplots-machine-learning/)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3. Interpet biplot\n",
+ "- observe which features contribute most to each principal component\n",
+ "- relate the directions of feature vectors to clusters identified by your earlier clustering"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 6c - Silhouette Plots"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1. Calculate silouette scores\n",
+ "- calculate the silhouette scores for the clusters created by each clustering method (or go back to the Evaluation notebook where you calculated them previously)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Generate silhouette plots\n",
+ "- follow the [Scikit-learn silhouette plot example](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) to create silhouette plots for each clustering method to see how each individual point cotributes to a cluster's overall sihouette score\n",
+ "- optionally, go back and generate silhouette plots for different cluster counts for K-Means and Hierarchical clustering, and see if your opinion on optimal cluster count changes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "language_info": {
+ "name": "python"
+ },
+ "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}