Reimplementation and Hierarchical Extensions of scVAE (Grønbech et al., 2020)
Paper • Original Code • Report • Poster
Course project – Introduction to Probabilistic Graphical Models
Master MVA (ENS Paris-Saclay)
Authors: Raphaël Rubrice · Adam Keddis · Tiffney Aina
This repository provides a from-scratch reimplementation of scVAE and extends it with hierarchical mixture models for single-cell RNA-seq data.
scRNA-seq data are:
- high-dimensional,
- sparse,
- overdispersed,
- and biologically hierarchical (lineages → subtypes).
While scVAE models counts using a Gaussian Mixture VAE with a Negative Binomial likelihood, it relies on a flat mixture prior, which cannot explicitly encode biological hierarchies.
We address this limitation by introducing two extensions:
- IndMoMVAE – Independent Mixture-of-Mixtures VAE
- MoMixVAE – Hierarchical Mixture-of-Mixtures VAE
All models are evaluated on PBMC datasets with a curated 4-level cell-type hierarchy.
A flexible generalization of scVAE:
- Arbitrary latent priors (Normal, Student-t)
- Explicit categorical prior for clustering
- Modular distributions and training loops
- Multiple independent mixture branches
- Each branch learns a different partition of the data
- No hierarchical dependency between levels
This model acts as an ablation baseline to isolate the effect of hierarchy.
- Explicit hierarchical dependencies between clustering levels
- Coarse-to-fine latent organization
- Structured variational posterior
- Hierarchical ELBO with:
- β-scaled KL terms
- marginal-usage regularization to prevent component collapse
This model best reflects biological lineage structure.
To better motivate the proposed extensions, the table below summarizes the main limitations of a flat mixture prior and the corresponding solutions implemented in this project:
| Limitation | Proposed solution |
|---|---|
| Standard Gaussian prior does not allow cell-type clustering | MixtureVAE replaces the Gaussian prior with a mixture distribution to learn latent clusters |
| No hierarchical structure between clustering levels | IndMoMVAE trains multiple independent mixture branches to explore different clustering granularities |
| No joint modeling of hierarchical levels | MoMixVAE uses a mixture-of-mixtures formulation to jointly learn clusters at multiple hierarchy levels |
We use uv, a modern and fast Python package manager.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/raphaelrubrice/scVAE_mva2025.git
cd scVAE_mva2025
# Create and activate environment
uv venv
source .venv/bin/activate
uv syncThe full PBMC processing pipeline is implemented in data_pipeline/ and is fully reproducible.
- Download raw PBMC datasets from 10x Genomics
- Load raw 10x matrices into
AnnDataobjects - Annotate cells with a curated 4-level hierarchical cell-type taxonomy
- Freeze label vocabularies across datasets
- Shard each dataset as
.h5ad - Combine shards into a unified
AnnCollection - Build stratified train / validation / test PyTorch
DataLoaders
The pipeline is compatible with local execution and Google Colab.
Each cell is annotated with four hierarchical levels:
| Level | Meaning |
|---|---|
| 1 | Stem vs Non-stem |
| 2 | Major lineage (B / NK / T) |
| 3 | Intermediate lineage (CD4 / CD8) |
| 4 | Terminal subtype |
Label metadata and dataset URLs are defined in data_pipeline/src/config.py.
| Experiment | Description | Notebook / Script | Run on Colab |
|---|---|---|---|
| MixtureVAE | Trains a mixture-prior VAE on the combined PBMC dataset | PBMC_experiments_MixtureVAE.ipynb |
|
| IndMoMVAE | Independent mixture branches to explore multiple clustering granularities | PBMC_experiments_IndMoMVAE.ipynb |
|
| MoMixVAE | Joint hierarchical clustering using a mixture-of-mixtures model | PBMC_experiments_MoMixVAE.ipynb |
We evaluate models using complementary metrics:
-
IWAE log-likelihood
Measures generative modeling quality -
Weighted F1-score
Measures clustering quality after Hungarian label alignment -
Adjusted Rand Index (ARI)
Measures agreement between predicted and true clusters
- Student-t latent priors improve generative quality
- MoMixVAE provides the best balance between reconstruction and clustering
- F1-score is more stable than ARI on nonlinear latent manifolds
scVAE_mva2025/
├── mixture_vae/
│ ├── mvae.py # MixtureVAE, IndMoMVAE, MoMixVAE
│ ├── distributions.py # Latent & likelihood distributions
│ ├── training.py # Training protocols
│ ├── utils.py # Metrics & evaluation
│ └── viz.py # Visualization utilities
│
├── data_pipeline/
│ ├── src/
│ │ ├── downloader.py
│ │ ├── load_anndata.py
│ │ ├── combine.py
│ │ ├── dataloader.py
│ │ └── config.py
│
├── PBMC_experiments*.ipynb
├── pyproject.toml
├── uv.lock
└── README.md
- Full reimplementation of scVAE
- Hierarchical Mixture-of-Mixtures formulation
- Robust and reproducible PBMC data pipeline
- Stable training via KL warm-up and marginal regularization
- Extensive quantitative and qualitative evaluation
- Grønbech et al., scVAE: Variational Auto-Encoders for Single-Cell Gene Expression Data, Bioinformatics, 2020
- Kingma & Welling, Auto-Encoding Variational Bayes, ICLR 2014
