Skip to content

massimofedrigo/randomized-svd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

randomized-svd: Fast Randomized SVD implemented in Python

Python Version License Build Status Code Style

randomized-svd is a lightweight, high-performance Python library for computing the Randomized Singular Value Decomposition (rSVD).

It is designed to handle massive matrices efficiently by decomposing them into a smaller, random subspace before computing the SVD. This approach is significantly faster than deterministic methods (like LAPACK's dgesdd) while maintaining high numerical accuracy for low-rank approximations.

Original Research: This library is the engineering implementation of the thesis "A Randomized Algorithm for SVD Calculation" (M. Fedrigo). You can read the full theoretical background in the docs/thesis.pdf.


πŸš€ Key Features

  • Smart Dispatching: Automatically selects the optimal algorithm strategy for "Tall-and-Skinny" ($m \ge n$) vs "Short-and-Fat" ($m < n$) matrices to minimize memory footprint.
  • Randomized PCA (New): Performs Principal Component Analysis with Virtual Centering. It computes PCA on sparse matrices without ever materializing the dense centered matrix ($X - \mu$), saving gigabytes of RAM.
  • Scikit-Learn Compatible: Features a RandomizedSVD class wrapper that acts as a drop-in replacement for TruncatedSVD. Works natively with Pipelines and GridSearchCV.
  • Sparse Matrix Support: Natively supports scipy.sparse matrices (CSR/CSC). It performs matrix multiplications without ever densifying the data.
  • Automatic Denoising: Includes an implementation of the Gavish-Donoho method for optimal hard thresholding. Unlike basic heuristics, it solves the Marchenko-Pastur integral numerically to find the optimal rank, supporting both known noise levels (sigma) and automatic noise estimation.
  • Robustness: Uses internal Oversampling to ensure the target singular vectors are captured correctly, minimizing the probability of approximation errors.
  • Production Ready: Fully type-hinted, unit-tested, and packaged with modern standards (pyproject.toml).

πŸ›  Installation

Choose the installation method based on your needs.

πŸ“¦ For Users (Standard Usage)

If you just want to use the library in your own project:

  1. Activate your project's virtual environment (recommended).
  2. Install from PyPI:
pip install randomized-svd

πŸ’» For Developers (Contributing)

If you want to run tests, modify the code, or contribute:

  1. Clone the repository:
git clone https://github.com/massimofedrigo/randomized-svd.git
cd randomized-svd
  1. Create and Activate a Virtual Environment: Linux / macOS:
python3 -m venv venv
source venv/bin/activate

Windows (PowerShell):

python -m venv venv
.\venv\Scripts\activate
  1. Install in Editable Mode: This installs the package along with testing dependencies (pytest, etc.) and reflects code changes immediately.
pip install -e ".[dev]"

⚑ Quick Start

1. Basic Decomposition

Compute the approximated SVD of a generic matrix.

import numpy as np
from randomized_svd import rsvd

# Generate a large random matrix (1000 x 500)
X = np.random.randn(1000, 500)

# Compute rSVD with target rank k=10
U, S, Vt = rsvd(X, t=10)

print(f"U shape: {U.shape}")    # (1000, 10)
print(f"S shape: {S.shape}")    # (10, 10)
print(f"Vt shape: {Vt.shape}")  # (10, 500)

2. Automatic Noise Reduction (Denoising)

Use the Gavish-Donoho optimal threshold to remove white noise from a signal.

import numpy as np
from randomized_svd import rsvd, optimal_rank

# 1. Create a synthetic signal + noise
X_true = np.random.randn(1000, 5) @ np.random.randn(5, 500)  # Rank 5
X_noisy = X_true + 1.0 * np.random.randn(1000, 500)          # Add Noise (sigma=1.0)

# 2. Run rSVD (with a generous 't' to capture the tail)
U, S, Vt = rsvd(X_noisy, t=100)

# 3. Find optimal rank automatically
# sigma=None triggers automatic noise level estimation from the data median
k_opt = optimal_rank(m=1000, n=500, S=S, sigma=None)
print(f"Optimal Rank Found: {k_opt}")  # Should be close to 5

# 4. Truncate to clean signal
X_clean = U[:, :k_opt] @ np.diag(S[:k_opt]) @ Vt[:k_opt, :]

3. Sparse Matrices (Memory Efficient)

For Recommender Systems or NLP, matrices are often huge but sparse. rsvd handles scipy.sparse objects directly, saving RAM.

import scipy.sparse as sp
from randomized_svd import rsvd

# Create a massive 10k x 10k matrix with 1% density
X_sparse = sp.rand(10000, 10000, density=0.01, format='csr')

# Compute SVD directly (fast & memory efficient)
U, S, Vt = rsvd(X_sparse, t=50)

print(f"Computed {len(S)} singular values without OOM errors.")

4. High Accuracy Mode

For matrices with slowly decaying singular values (flat spectrum), standard randomized projections might miss some information. You can improve accuracy using Power Iterations (p) and Oversampling.

  • p (Power Iterations): Exponentiates the singular values to make them "pop out" more clearly.
  • oversampling: Projects onto a slightly larger subspace () to create a safety buffer, which is truncated at the end.
import numpy as np
from randomized_svd import rsvd

X = np.random.randn(1000, 500)

# Compute rSVD with:
# - Target rank t=10
# - Power iterations p=2 (Recommended for slowly decaying spectra)
# - Oversampling=20 (Compute 30 components internally, return best 10)
U, S, Vt = rsvd(X, t=10, p=2, oversampling=20)

5. Scikit-Learn API (Pipelines & GridSearch)

Use the RandomizedSVD class to integrate seamlessly with Scikit-Learn pipelines.

from randomized_svd import RandomizedSVD
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Create a pipeline: Scale data -> Reduce Dimensions
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rsvd', RandomizedSVD(n_components=10, random_state=42))
])

# Fit and Transform in one go
X_reduced = pipeline.fit_transform(X)

6. Randomized PCA (Virtual Centering)

Perform PCA on sparse data without centering explicitly (which would destroy sparsity). rpca handles this automatically.

import scipy.sparse as sp
from randomized_svd import rpca

# Sparse matrix with non-zero mean
X_sparse = sp.rand(1000, 1000, density=0.01, format='csr')

# Compute PCA directly
# This avoids creating the dense (X - mean) matrix in memory
U, S, Vt = rpca(X_sparse, t=10)

πŸ— Project Structure

The project follows a modern src-layout to prevent import errors and ensure clean packaging.

randomized-svd/
β”œβ”€β”€ .github/workflows/    # CI/CD pipelines
β”œβ”€β”€ docs/                 # Thesis PDF and extra documentation
β”œβ”€β”€ examples/             # Jupyter Notebooks (Demos & Benchmarks)
β”œβ”€β”€ src/                  # Source code
β”‚   └── randomized_svd/   # Package source
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ core.py       # Main rSVD logic (Facade & Implementations)
β”‚       β”œβ”€β”€ pca.py        # Randomized PCA & Virtual Centering
β”‚       β”œβ”€β”€ sklearn.py    # Scikit-Learn Wrapper
β”‚       └── threshold.py  # Gavish-Donoho & Marchenko-Pastur logic
β”œβ”€β”€ tests/                # Pytest suite
β”œβ”€β”€ Dockerfile            # Reproducible testing environment
β”œβ”€β”€ pyproject.toml        # Dependencies and metadata
└── README.md


🐳 Docker Support

To ensure reproducibility across different machines and operating systems, we provide a Dockerfile.

Note: Docker is primarily used here for running the test suite in an isolated, clean environment. For using the library in your own projects, the standard pip install (above) is recommended.

Build the image:

docker build -t randomized-svd-test .

Run the test suite:

docker run randomized-svd-test

πŸ“ˆ Performance

Benchmarks run on an Intel i7, 16GB RAM.

Matrix Size Method Time (s) Speedup
5000 x 5000 rSVD (k=50) 0.82s ~12x
5000 x 5000 NumPy SVD 9.94s -

See examples/2_benchmark_performance.ipynb for the full reproduction script.


πŸ§ͺ Testing

We use pytest for unit testing, covering:

  1. Invariance: Output dimensions match mathematical expectations.
  2. Accuracy: Reconstruction error on low-rank matrices is negligible.
  3. Orthogonality: and matrices are verified to be orthogonal.

Run tests locally (requires dev installation):

pytest -v

πŸ“š References

  1. Fedrigo, M. (2024). A Randomized Algorithm for SVD Calculation. PDF Available.
  2. Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review.
  3. Gavish, M., & Donoho, D. L. (2014). The optimal hard threshold for singular values is $4/\sqrt{3}$.
  4. Brunton, S. L., & Kutz, N. J. (2019). Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Author: Massimo Fedrigo

Portfolio & Research: massimofedrigo.com

Contact: contact@massimofedrigo.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published