Unsupervised ML: Dimensionality Reduction

Unsupervised Machine Learning

Sciki-Learn: Dimensionality Reduction

Dimensionality reduction is a technique that reduces the number of dimensions in a dataset without the need for labels and can be useful for visualization, data compression, and machine learning.

The most popular library for dimensionality reduction is scikit-learn (sklearn). It provides three main modules for dimensionality reduction algorithms:

Decomposition algorithms
Manifold learning algorithms
- t-Distributed Stochastic Neighbor Embedding
- Spectral Embedding
Discriminant Analysis
- Linear Discriminant Analysis

Some of the most popular dimensionality reduction algorithms included in scikit-learn are:

Linear and Quadratic discriminant analysis (LDA) is a supervised dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace that separates different classes of data as well as possible.
Principal component analysis (PCA) is a linear dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace that preserves as much of the variance of the data as possible.
Incremental Principal Component Analysis is used instead of PCA for large datasets, using less memory by processing data in smaller batches while still depending on the features of the data.
Kernel PCA (KPCA) is a nonlinear dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace using a kernel function. Sparse principal component analysis (SPCA) is a dimensionality reduction algorithm that preserves the sparsity of the data while reducing its dimensionality.
PCA using randomized SVD. Randomized SVD reduces the computational cost of singular value decomposition.
Nonnegative matrix factorization (NMF) is a dimensionality reduction algorithm that decomposes a data matrix into a nonnegative basis matrix and a coefficient matrix.

The choice of which algorithm to use depends on the specific application. For example, PCA is a good choice for applications where the data is linearly correlated, while LDA is a good choice for applications where the data is not linearly correlated.

To use scikit-learn for dimensionality reduction, import libraries, load data, choose an algorithm, fit it to the data, transform the data, and evaluate the results.

For example:

import numpy as np
from sklearn.decomposition import PCA

# Load the data
data = np.loadtxt('data.csv', delimiter=',')

# Choose a dimensionality reduction algorithm
pca = PCA(n_components=2)

# Fit the algorithm to the data
pca.fit(data)

# Transform the data
reduced_data = pca.transform(data)

# Evaluate the results
print('The explained variance ratio is:', pca.explained_variance_ratio_)

References

Scikit-Learn Dimensionality Reduction Documentation
sklearn.decomposition
What is Dimensionality Reduction. J. Murel, E. Kavlakoglu. IBM.
Dimensionality Reduction for Machine Learning. N. Barla. Neptune.ai.
Introduction to Machine Learning with Scikit-Learn. Carpentries lesson.
6 Dimensionality Reduction Algorithms With Python. Jason Brownlee. Machine Learning Mastery.

Please see Jupyter Notebook Example

Created: 04/23/2023 (C. Lizárraga); Last update: 02/16/2025 (C. Lizárraga)

CC BY-NC-SA 4.0

UArizona DataLab, Data Science Institute, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unsupervised ML: Dimensionality Reduction

Unsupervised Machine Learning

Sciki-Learn: Dimensionality Reduction

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally