Skip to content

Unsupervised ML: Dimensionality Reduction

Carlos Lizarraga-Celaya edited this page Feb 17, 2025 · 6 revisions

Unsupervised Machine Learning

Sciki-Learn: Dimensionality Reduction


Dimensionality reduction is a technique that reduces the number of dimensions in a dataset without the need for labels and can be useful for visualization, data compression, and machine learning.

The most popular library for dimensionality reduction is scikit-learn (sklearn). It provides three main modules for dimensionality reduction algorithms:

  1. Decomposition algorithms
  2. Manifold learning algorithms
  3. Discriminant Analysis

Some of the most popular dimensionality reduction algorithms included in scikit-learn are:

  • Linear and Quadratic discriminant analysis (LDA) is a supervised dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace that separates different classes of data as well as possible.
  • Principal component analysis (PCA) is a linear dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace that preserves as much of the variance of the data as possible.
  • Incremental Principal Component Analysis is used instead of PCA for large datasets, using less memory by processing data in smaller batches while still depending on the features of the data.
  • Kernel PCA (KPCA) is a nonlinear dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace using a kernel function. Sparse principal component analysis (SPCA) is a dimensionality reduction algorithm that preserves the sparsity of the data while reducing its dimensionality.
  • PCA using randomized SVD. Randomized SVD reduces the computational cost of singular value decomposition.
  • Nonnegative matrix factorization (NMF) is a dimensionality reduction algorithm that decomposes a data matrix into a nonnegative basis matrix and a coefficient matrix.

The choice of which algorithm to use depends on the specific application. For example, PCA is a good choice for applications where the data is linearly correlated, while LDA is a good choice for applications where the data is not linearly correlated.


To use scikit-learn for dimensionality reduction, import libraries, load data, choose an algorithm, fit it to the data, transform the data, and evaluate the results.

For example:

import numpy as np
from sklearn.decomposition import PCA

# Load the data
data = np.loadtxt('data.csv', delimiter=',')

# Choose a dimensionality reduction algorithm
pca = PCA(n_components=2)

# Fit the algorithm to the data
pca.fit(data)

# Transform the data
reduced_data = pca.transform(data)

# Evaluate the results
print('The explained variance ratio is:', pca.explained_variance_ratio_)

References


Please see Jupyter Notebook Example


Created: 04/23/2023 (C. Lizárraga); Last update: 02/16/2025 (C. Lizárraga)

CC BY-NC-SA 4.0

Clone this wiki locally