-
Notifications
You must be signed in to change notification settings - Fork 0
Unsupervised ML: Dimensionality Reduction
Dimensionality reduction is a technique that reduces the number of dimensions in a dataset without the need for labels and can be useful for visualization, data compression, and machine learning.
The most popular library for dimensionality reduction is scikit-learn (sklearn). It provides three main modules for dimensionality reduction algorithms:
- Decomposition algorithms
- Manifold learning algorithms
- Discriminant Analysis
Some of the most popular dimensionality reduction algorithms included in scikit-learn are:
- Linear and Quadratic discriminant analysis (LDA) is a supervised dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace that separates different classes of data as well as possible.
- Principal component analysis (PCA) is a linear dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace that preserves as much of the variance of the data as possible.
- Incremental Principal Component Analysis is used instead of PCA for large datasets, using less memory by processing data in smaller batches while still depending on the features of the data.
- Kernel PCA (KPCA) is a nonlinear dimensionality reduction algorithm that projects data points onto a lower-dimensional subspace using a kernel function. Sparse principal component analysis (SPCA) is a dimensionality reduction algorithm that preserves the sparsity of the data while reducing its dimensionality.
- PCA using randomized SVD. Randomized SVD reduces the computational cost of singular value decomposition.
- Nonnegative matrix factorization (NMF) is a dimensionality reduction algorithm that decomposes a data matrix into a nonnegative basis matrix and a coefficient matrix.
The choice of which algorithm to use depends on the specific application. For example, PCA is a good choice for applications where the data is linearly correlated, while LDA is a good choice for applications where the data is not linearly correlated.
To use scikit-learn for dimensionality reduction, import libraries, load data, choose an algorithm, fit it to the data, transform the data, and evaluate the results.
For example:
import numpy as np
from sklearn.decomposition import PCA
# Load the data
data = np.loadtxt('data.csv', delimiter=',')
# Choose a dimensionality reduction algorithm
pca = PCA(n_components=2)
# Fit the algorithm to the data
pca.fit(data)
# Transform the data
reduced_data = pca.transform(data)
# Evaluate the results
print('The explained variance ratio is:', pca.explained_variance_ratio_)
- Scikit-Learn Dimensionality Reduction Documentation
sklearn.decomposition
- What is Dimensionality Reduction. J. Murel, E. Kavlakoglu. IBM.
- Dimensionality Reduction for Machine Learning. N. Barla. Neptune.ai.
- Introduction to Machine Learning with Scikit-Learn. Carpentries lesson.
- 6 Dimensionality Reduction Algorithms With Python. Jason Brownlee. Machine Learning Mastery.
Please see Jupyter Notebook Example
Created: 04/23/2023 (C. Lizárraga); Last update: 02/16/2025 (C. Lizárraga)
UArizona DataLab, Data Science Institute, 2025