randomized-svd is a lightweight, high-performance Python library for computing the Randomized Singular Value Decomposition (rSVD).
It is designed to handle massive matrices efficiently by decomposing them into a smaller, random subspace before computing the SVD. This approach is significantly faster than deterministic methods (like LAPACK's dgesdd) while maintaining high numerical accuracy for low-rank approximations.
Original Research: This library is the engineering implementation of the thesis "A Randomized Algorithm for SVD Calculation" (M. Fedrigo). You can read the full theoretical background in the docs/thesis.pdf.
-
Smart Dispatching: Automatically selects the optimal algorithm strategy for "Tall-and-Skinny" (
$m \ge n$ ) vs "Short-and-Fat" ($m < n$ ) matrices to minimize memory footprint. -
Randomized PCA (New): Performs Principal Component Analysis with Virtual Centering. It computes PCA on sparse matrices without ever materializing the dense centered matrix (
$X - \mu$ ), saving gigabytes of RAM. -
Scikit-Learn Compatible: Features a
RandomizedSVDclass wrapper that acts as a drop-in replacement forTruncatedSVD. Works natively with Pipelines and GridSearchCV. -
Sparse Matrix Support: Natively supports
scipy.sparsematrices (CSR/CSC). It performs matrix multiplications without ever densifying the data. -
Automatic Denoising: Includes an implementation of the Gavish-Donoho method for optimal hard thresholding. Unlike basic heuristics, it solves the Marchenko-Pastur integral numerically to find the optimal rank, supporting both known noise levels (
sigma) and automatic noise estimation. - Robustness: Uses internal Oversampling to ensure the target singular vectors are captured correctly, minimizing the probability of approximation errors.
-
Production Ready: Fully type-hinted, unit-tested, and packaged with modern standards (
pyproject.toml).
Choose the installation method based on your needs.
If you just want to use the library in your own project:
- Activate your project's virtual environment (recommended).
- Install from PyPI:
pip install randomized-svd
If you want to run tests, modify the code, or contribute:
- Clone the repository:
git clone https://github.com/massimofedrigo/randomized-svd.git
cd randomized-svd
- Create and Activate a Virtual Environment: Linux / macOS:
python3 -m venv venv
source venv/bin/activate
Windows (PowerShell):
python -m venv venv
.\venv\Scripts\activate
- Install in Editable Mode:
This installs the package along with testing dependencies (
pytest, etc.) and reflects code changes immediately.
pip install -e ".[dev]"
Compute the approximated SVD of a generic matrix.
import numpy as np
from randomized_svd import rsvd
# Generate a large random matrix (1000 x 500)
X = np.random.randn(1000, 500)
# Compute rSVD with target rank k=10
U, S, Vt = rsvd(X, t=10)
print(f"U shape: {U.shape}") # (1000, 10)
print(f"S shape: {S.shape}") # (10, 10)
print(f"Vt shape: {Vt.shape}") # (10, 500)Use the Gavish-Donoho optimal threshold to remove white noise from a signal.
import numpy as np
from randomized_svd import rsvd, optimal_rank
# 1. Create a synthetic signal + noise
X_true = np.random.randn(1000, 5) @ np.random.randn(5, 500) # Rank 5
X_noisy = X_true + 1.0 * np.random.randn(1000, 500) # Add Noise (sigma=1.0)
# 2. Run rSVD (with a generous 't' to capture the tail)
U, S, Vt = rsvd(X_noisy, t=100)
# 3. Find optimal rank automatically
# sigma=None triggers automatic noise level estimation from the data median
k_opt = optimal_rank(m=1000, n=500, S=S, sigma=None)
print(f"Optimal Rank Found: {k_opt}") # Should be close to 5
# 4. Truncate to clean signal
X_clean = U[:, :k_opt] @ np.diag(S[:k_opt]) @ Vt[:k_opt, :]For Recommender Systems or NLP, matrices are often huge but sparse. rsvd handles scipy.sparse objects directly, saving RAM.
import scipy.sparse as sp
from randomized_svd import rsvd
# Create a massive 10k x 10k matrix with 1% density
X_sparse = sp.rand(10000, 10000, density=0.01, format='csr')
# Compute SVD directly (fast & memory efficient)
U, S, Vt = rsvd(X_sparse, t=50)
print(f"Computed {len(S)} singular values without OOM errors.")For matrices with slowly decaying singular values (flat spectrum), standard randomized projections might miss some information. You can improve accuracy using Power Iterations (p) and Oversampling.
p(Power Iterations): Exponentiates the singular values to make them "pop out" more clearly.oversampling: Projects onto a slightly larger subspace () to create a safety buffer, which is truncated at the end.
import numpy as np
from randomized_svd import rsvd
X = np.random.randn(1000, 500)
# Compute rSVD with:
# - Target rank t=10
# - Power iterations p=2 (Recommended for slowly decaying spectra)
# - Oversampling=20 (Compute 30 components internally, return best 10)
U, S, Vt = rsvd(X, t=10, p=2, oversampling=20)Use the RandomizedSVD class to integrate seamlessly with Scikit-Learn pipelines.
from randomized_svd import RandomizedSVD
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Create a pipeline: Scale data -> Reduce Dimensions
pipeline = Pipeline([
('scaler', StandardScaler()),
('rsvd', RandomizedSVD(n_components=10, random_state=42))
])
# Fit and Transform in one go
X_reduced = pipeline.fit_transform(X)Perform PCA on sparse data without centering explicitly (which would destroy sparsity). rpca handles this automatically.
import scipy.sparse as sp
from randomized_svd import rpca
# Sparse matrix with non-zero mean
X_sparse = sp.rand(1000, 1000, density=0.01, format='csr')
# Compute PCA directly
# This avoids creating the dense (X - mean) matrix in memory
U, S, Vt = rpca(X_sparse, t=10)The project follows a modern src-layout to prevent import errors and ensure clean packaging.
randomized-svd/
βββ .github/workflows/ # CI/CD pipelines
βββ docs/ # Thesis PDF and extra documentation
βββ examples/ # Jupyter Notebooks (Demos & Benchmarks)
βββ src/ # Source code
β βββ randomized_svd/ # Package source
β βββ __init__.py
β βββ core.py # Main rSVD logic (Facade & Implementations)
β βββ pca.py # Randomized PCA & Virtual Centering
β βββ sklearn.py # Scikit-Learn Wrapper
β βββ threshold.py # Gavish-Donoho & Marchenko-Pastur logic
βββ tests/ # Pytest suite
βββ Dockerfile # Reproducible testing environment
βββ pyproject.toml # Dependencies and metadata
βββ README.md
To ensure reproducibility across different machines and operating systems, we provide a Dockerfile.
Note: Docker is primarily used here for running the test suite in an isolated, clean environment. For using the library in your own projects, the standard
pip install(above) is recommended.
Build the image:
docker build -t randomized-svd-test .
Run the test suite:
docker run randomized-svd-test
Benchmarks run on an Intel i7, 16GB RAM.
| Matrix Size | Method | Time (s) | Speedup |
|---|---|---|---|
| 5000 x 5000 | rSVD (k=50) | 0.82s | ~12x |
| 5000 x 5000 | NumPy SVD | 9.94s | - |
See examples/2_benchmark_performance.ipynb for the full reproduction script.
We use pytest for unit testing, covering:
- Invariance: Output dimensions match mathematical expectations.
- Accuracy: Reconstruction error on low-rank matrices is negligible.
- Orthogonality: and matrices are verified to be orthogonal.
Run tests locally (requires dev installation):
pytest -v
- Fedrigo, M. (2024). A Randomized Algorithm for SVD Calculation. PDF Available.
- Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review.
- Gavish, M., & Donoho, D. L. (2014). The optimal hard threshold for singular values is $4/\sqrt{3}$.
- Brunton, S. L., & Kutz, N. J. (2019). Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control.
This project is licensed under the MIT License - see the LICENSE file for details.
Author: Massimo Fedrigo
Portfolio & Research: massimofedrigo.com
Contact: contact@massimofedrigo.com