Introduction to clustering

Clustering na one kain Unsupervised Learning wey dey assume say dataset no get label or say e input no dey match with any predefined output. E dey use different algorithm to arrange data wey no get label and group dem based on pattern wey e see for di data.

🎥 Click di image wey dey up for video. As you dey study machine learning with clustering, make you enjoy some Nigerian Dance Hall songs - dis na one correct song from 2014 by PSquare.

Pre-lecture quiz

Introduction

Clustering dey very useful for data exploration. Make we see if e fit help us discover trends and pattern for how Nigerian people dey enjoy music.

✅ Take one minute think about how clustering dey useful. For real life, clustering dey happen anytime you get pile of clothes wey you wan sort out for your family members 🧦👕👖🩲. For data science, clustering dey happen when you dey try analyze user preference or determine di characteristics of any dataset wey no get label. Clustering dey help make sense of wahala, like sock drawer.

🎥 Click di image wey dey up for video: MIT's John Guttag dey explain clustering

For professional setting, clustering fit help determine things like market segmentation, like which age group dey buy which item. Another use na anomaly detection, maybe to catch fraud for dataset of credit card transactions. Or you fit use clustering to find tumor for batch of medical scans.

✅ Think small about how you don see clustering 'for di wild', maybe for banking, e-commerce, or business setting.

🎓 E dey interesting say cluster analysis start for Anthropology and Psychology for di 1930s. You fit imagine how dem take use am?

Another way you fit use am na to group search results - like shopping links, images, or reviews. Clustering dey useful when you get big dataset wey you wan reduce and perform more detailed analysis on top am, so di technique fit help you learn about di data before you build other models.

✅ Once you don organize your data inside clusters, you go give am cluster Id, and dis technique fit dey useful to protect di privacy of di dataset; you fit dey refer to data point by di cluster id instead of di more revealing identifiable data. You fit think of other reasons why you go prefer use cluster Id instead of other elements of di cluster to identify am?

Make you learn more about clustering techniques for dis Learn module

Getting started with clustering

Scikit-learn get plenty methods wey you fit use for clustering. Di one wey you go choose go depend on your use case. According to di documentation, each method get different benefits. Dis na simple table of di methods wey Scikit-learn support and di use case wey dem fit:

Method name	Use case
K-Means	general purpose, inductive
Affinity propagation	many, uneven clusters, inductive
Mean-shift	many, uneven clusters, inductive
Spectral clustering	few, even clusters, transductive
Ward hierarchical clustering	many, constrained clusters, transductive
Agglomerative clustering	many, constrained, non Euclidean distances, transductive
DBSCAN	non-flat geometry, uneven clusters, transductive
OPTICS	non-flat geometry, uneven clusters with variable density, transductive
Gaussian mixtures	flat geometry, inductive
BIRCH	large dataset with outliers, inductive

🎓 How we dey create clusters get plenty to do with how we dey gather di data points into groups. Make we break down some vocabulary:

🎓 'Transductive' vs. 'inductive'

Transductive inference dey come from training cases wey dem observe wey dey map to specific test cases. Inductive inference dey come from training cases wey dey map to general rules wey dem go later apply to test cases.

Example: Imagine say you get dataset wey no complete label. Some things dey labelled as 'records', some 'cds', and some dey blank. Your work na to give label to di blank ones. If you choose inductive approach, you go train model wey dey look for 'records' and 'cds', then apply di labels to di data wey no get label. Dis approach go struggle to classify things wey be 'cassettes'. Transductive approach go handle dis unknown data better as e dey group similar items together before e go give label to di group. For dis case, clusters fit show 'round musical things' and 'square musical things'.

🎓 'Non-flat' vs. 'flat' geometry

Di term dey come from mathematics, non-flat vs. flat geometry dey talk about how we dey measure distance between points, either 'flat' (Euclidean) or 'non-flat' (non-Euclidean) geometry.

'Flat' for dis context mean Euclidean geometry (parts of am dey taught as 'plane' geometry), and non-flat mean non-Euclidean geometry. Wetin geometry get to do with machine learning? Well, as di two fields dey based on mathematics, we need common way to measure distance between points for clusters, and we fit do am in 'flat' or 'non-flat' way, depending on di nature of di data. Euclidean distances dey measure di length of line segment between two points. Non-Euclidean distances dey measure distance along curve. If your data, when you visualize am, no dey for plane, you go need special algorithm to handle am.

Infographic by Dasani Madipalli

🎓 'Distances'

Clusters dey defined by di distance matrix, e.g. di distance between points. Dis distance fit dey measured in different ways. Euclidean clusters dey defined by di average of di point values, and dem get 'centroid' or center point. Distance dey measured by di distance to di centroid. Non-Euclidean distances dey refer to 'clustroids', di point wey dey closest to other points. Clustroids fit dey defined in different ways.

🎓 'Constrained'

Constrained Clustering dey add 'semi-supervised' learning to dis unsupervised method. Di relationship between points dey flagged as 'cannot link' or 'must-link' so some rules go dey forced on di dataset.

Example: If algorithm dey free to work on batch of data wey no get label or wey get small label, di clusters wey e go produce fit no make sense. For di example wey dey up, di clusters fit group 'round music things', 'square music things', 'triangular things', and 'cookies'. If you give am some constraints or rules ("di item must be made of plastic", "di item need fit produce music") e go help 'constrain' di algorithm to make better choices.

🎓 'Density'

Data wey dey 'noisy' dey considered as 'dense'. Di distance between points for each cluster fit dey more or less dense, or 'crowded', and dis kind data need di correct clustering method. Dis article dey show di difference between using K-Means clustering vs. HDBSCAN algorithms to explore noisy dataset wey get uneven cluster density.

Clustering algorithms

Plenty clustering algorithms dey, more than 100, and di one wey you go use depend on di nature of di data wey you get. Make we talk about some major ones:

Hierarchical clustering. If object dey classified by how e near another object, instead of how far e dey, clusters go form based on di distance of di members to and from other objects. Scikit-learn agglomerative clustering na hierarchical.

Infographic by Dasani Madipalli
Centroid clustering. Dis popular algorithm dey require make you choose 'k', or di number of clusters wey you wan form, then di algorithm go find di center point of di cluster and gather data around di point. K-means clustering na popular version of centroid clustering. Di center dey determined by di nearest mean, na why dem call am di name. Di squared distance from di cluster dey minimized.

Infographic by Dasani Madipalli
Distribution-based clustering. Dis one dey based on statistical modeling, e dey focus on di probability say data point belong to cluster, then e go assign am. Gaussian mixture methods dey belong to dis type.
Density-based clustering. Data points dey assigned to clusters based on di density, or how dem dey group around each other. Data points wey dey far from di group dey considered as outliers or noise. DBSCAN, Mean-shift and OPTICS dey belong to dis type of clustering.
Grid-based clustering. For multi-dimensional datasets, grid go dey created and di data go dey divided among di grid cells, so clusters go dey formed.

Exercise - cluster your data

Clustering dey work well when you fit visualize am well, so make we start by visualizing our music data. Dis exercise go help us decide which method of clustering go work best for di nature of dis data.

Open di notebook.ipynb file wey dey dis folder.
Import di Seaborn package to help you visualize di data well.
```
!pip install seaborn
```

Add di song data from nigerian-songs.csv. Load dataframe with some data about di songs. Prepare to explore di data by importing di libraries and dumping di data:

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("../data/nigerian-songs.csv")
df.head()

Check di first few lines of di data:

	name	album	artist	artist_top_genre	release_date	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	tempo	time_signature
0	Sparky	Mandy & The Jungle	Cruel Santino	alternative r&b	2019	144000	48	0.666	0.851	0.42	0.534	0.11	-6.699	0.0829	133.015	5
1	shuga rush	EVERYTHING YOU HEARD IS TRUE	Odunsi (The Engine)	afropop	2020	89488	30	0.71	0.0822	0.683	0.000169	0.101	-5.64	0.36	129.993	3

| 2 | LITT! | LITT! | AYLØ | indie r&b | 2018 | 207758 | 40 | 0.836 | 0.272 | 0.564 | 0.000537 | 0.11 | -7.127 | 0.0424 | 130.005 | 4 | | 3 | Confident / Feeling Cool | Enjoy Your Life | Lady Donli | nigerian pop | 2019 | 175135 | 14 | 0.894 | 0.798 | 0.611 | 0.000187 | 0.0964 | -4.961 | 0.113 | 111.087 | 4 | | 4 | wanted you | rare. | Odunsi (The Engine) | afropop | 2018 | 152049 | 25 | 0.702 | 0.116 | 0.833 | 0.91 | 0.348 | -6.044 | 0.0447 | 105.115 | 4 |

Make we check info about di dataframe, use info():

df.info()

Di output go look like dis:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              530 non-null    object 
 1   album             530 non-null    object 
 2   artist            530 non-null    object 
 3   artist_top_genre  530 non-null    object 
 4   release_date      530 non-null    int64  
 5   length            530 non-null    int64  
 6   popularity        530 non-null    int64  
 7   danceability      530 non-null    float64
 8   acousticness      530 non-null    float64
 9   energy            530 non-null    float64
 10  instrumentalness  530 non-null    float64
 11  liveness          530 non-null    float64
 12  loudness          530 non-null    float64
 13  speechiness       530 non-null    float64
 14  tempo             530 non-null    float64
 15  time_signature    530 non-null    int64  
dtypes: float64(8), int64(4), object(4)
memory usage: 66.4+ KB

Double-check say null values no dey, use isnull() and confirm say di sum na 0:

df.isnull().sum()

E dey okay:

name                0
album               0
artist              0
artist_top_genre    0
release_date        0
length              0
popularity          0
danceability        0
acousticness        0
energy              0
instrumentalness    0
liveness            0
loudness            0
speechiness         0
tempo               0
time_signature      0
dtype: int64

Describe di data:

df.describe()

	release_date	length	popularity	danceability	acousticness	energy	instrumentalness	liveness	loudness	speechiness	tempo	time_signature
count	530	530	530	530	530	530	530	530	530	530	530	530
mean	2015.390566	222298.1698	17.507547	0.741619	0.265412	0.760623	0.016305	0.147308	-4.953011	0.130748	116.487864	3.986792
std	3.131688	39696.82226	18.992212	0.117522	0.208342	0.148533	0.090321	0.123588	2.464186	0.092939	23.518601	0.333701
min	1998	89488	0	0.255	0.000665	0.111	0	0.0283	-19.362	0.0278	61.695	3
25%	2014	199305	0	0.681	0.089525	0.669	0	0.07565	-6.29875	0.0591	102.96125	4
50%	2016	218509	13	0.761	0.2205	0.7845	0.000004	0.1035	-4.5585	0.09795	112.7145	4
75%	2017	242098.5	31	0.8295	0.403	0.87575	0.000234	0.164	-3.331	0.177	125.03925	4
max	2020	511738	73	0.966	0.954	0.995	0.91	0.811	0.582	0.514	206.007	5

🤔 If we dey work with clustering, one unsupervised method wey no need labeled data, why we dey show dis data with labels? For di data exploration phase, e dey useful, but e no dey necessary for di clustering algorithms to work. You fit even remove di column headers and refer to di data by column number.

Make we look di general values for di data. Note say popularity fit be '0', wey mean say di song no get ranking. Make we remove dem soon.

Use barplot to find di most popular genres:

import seaborn as sns

top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top[:5].index,y=top[:5].values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')

✅ If you wan see more top values, change di top [:5] to bigger value, or remove am to see all.

Note, when di top genre dey described as 'Missing', e mean say Spotify no classify am, so make we remove am.

Remove missing data by filtering am out

df = df[df['artist_top_genre'] != 'Missing']
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')

Now check di genres again:

Di top three genres dey dominate dis dataset. Make we focus on afro dancehall, afropop, and nigerian pop, plus filter di dataset to remove anything wey get 0 popularity value (meaning e no dey classified with popularity for di dataset and fit be noise for our purpose):

df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
df = df[(df['popularity'] > 0)]
top = df['artist_top_genre'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(x=top.index,y=top.values)
plt.xticks(rotation=45)
plt.title('Top genres',color = 'blue')

Do quick test to see if di data dey correlate in any strong way:
```
corrmat = df.corr(numeric_only=True)
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
```
Di only strong correlation na between energy and loudness, wey no too surprise, as loud music dey usually energetic. Otherwise, di correlations dey relatively weak. E go dey interesting to see wetin clustering algorithm fit do with dis data.

🎓 Note say correlation no mean causation! We get proof of correlation but no proof of causation. One funny website get some visuals wey dey emphasize dis point.

E get any convergence for dis dataset around song popularity and danceability? FacetGrid dey show say concentric circles dey align, no matter di genre. E fit be say Nigerian taste dey converge for certain level of danceability for dis genre?

✅ Try different datapoints (energy, loudness, speechiness) and more or different musical genres. Wetin you fit discover? Check di df.describe() table to see di general spread of di data points.

Exercise - data distribution

Di three genres dey different well well for di perception of their danceability, based on their popularity?

Check di top three genres data distribution for popularity and danceability along given x and y axis.
```
sns.set_theme(style="ticks")

g = sns.jointplot(
    data=df,
    x="popularity", y="danceability", hue="artist_top_genre",
    kind="kde",
)
```
You fit discover concentric circles around general point of convergence, wey dey show di distribution of points.

🎓 Note say dis example dey use KDE (Kernel Density Estimate) graph wey dey represent di data using continuous probability density curve. Dis one dey help us interpret data when we dey work with multiple distributions.

Generally, di three genres dey align small in terms of their popularity and danceability. To find clusters for dis loosely-aligned data go dey challenging:

Create scatter plot:

sns.FacetGrid(df, hue="artist_top_genre", height=5) \
   .map(plt.scatter, "popularity", "danceability") \
   .add_legend()

Scatterplot for di same axes dey show similar pattern of convergence

Generally, for clustering, you fit use scatterplots to show clusters of data, so e good to sabi dis type of visualization well. For di next lesson, we go use dis filtered data and use k-means clustering to find groups for dis data wey dey overlap in interesting ways.

🚀Challenge

Prepare for di next lesson, make chart about di different clustering algorithms wey you fit discover and use for production environment. Wetin di clustering dey try solve?

Post-lecture quiz

Review & Self Study

Before you apply clustering algorithms, as we don learn, e good to understand di nature of your dataset. Read more about dis topic here

Dis helpful article go show you di different ways wey clustering algorithms dey behave, based on different data shapes.

Assignment

Research other visualizations for clustering

Disclaimer:
Dis dokyument don use AI transleshion service Co-op Translator do di transleshion. Even as we dey try make am accurate, abeg make you sabi say automatik transleshion fit get mistake or no dey correct well. Di original dokyument wey dey for im native language na di one wey you go take as di main source. For important mata, e good make you use professional human transleshion. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis transleshion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction to clustering

Pre-lecture quiz

Introduction

Getting started with clustering

Clustering algorithms

Exercise - cluster your data

Exercise - data distribution

🚀Challenge

Post-lecture quiz

Review & Self Study

Assignment

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Introduction to clustering

Pre-lecture quiz

Introduction

Getting started with clustering

Clustering algorithms

Exercise - cluster your data

Exercise - data distribution

🚀Challenge

Post-lecture quiz

Review & Self Study

Assignment