Skip to content

raphaelrubrice/scVAE_mva2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

187 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Variational Auto-Encoders for Single-Cell RNA-seq Data

Reimplementation and Hierarchical Extensions of scVAE (Grønbech et al., 2020)

Paper Python 3.10+ PyTorch License: MIT MVA: Introduction to Probabilistic Graphical Models

PaperOriginal CodeReportPoster

Course project – Introduction to Probabilistic Graphical Models
Master MVA (ENS Paris-Saclay)

Authors: Raphaël Rubrice · Adam Keddis · Tiffney Aina


Overview

This repository provides a from-scratch reimplementation of scVAE and extends it with hierarchical mixture models for single-cell RNA-seq data.

scRNA-seq data are:

  • high-dimensional,
  • sparse,
  • overdispersed,
  • and biologically hierarchical (lineages → subtypes).

While scVAE models counts using a Gaussian Mixture VAE with a Negative Binomial likelihood, it relies on a flat mixture prior, which cannot explicitly encode biological hierarchies.

We address this limitation by introducing two extensions:

  • IndMoMVAE – Independent Mixture-of-Mixtures VAE
  • MoMixVAE – Hierarchical Mixture-of-Mixtures VAE

All models are evaluated on PBMC datasets with a curated 4-level cell-type hierarchy.


Implemented Models

1. MixtureVAE (scVAE)

A flexible generalization of scVAE:

  • Arbitrary latent priors (Normal, Student-t)
  • Explicit categorical prior for clustering
  • Modular distributions and training loops

2. IndMoMVAE (Independent Mixture-of-Mixtures)

  • Multiple independent mixture branches
  • Each branch learns a different partition of the data
  • No hierarchical dependency between levels

This model acts as an ablation baseline to isolate the effect of hierarchy.


3. MoMixVAE (Hierarchical Mixture-of-Mixtures)

  • Explicit hierarchical dependencies between clustering levels
  • Coarse-to-fine latent organization
  • Structured variational posterior
  • Hierarchical ELBO with:
    • β-scaled KL terms
    • marginal-usage regularization to prevent component collapse

This model best reflects biological lineage structure.

Motivation and Solutions

To better motivate the proposed extensions, the table below summarizes the main limitations of a flat mixture prior and the corresponding solutions implemented in this project:

Limitation Proposed solution
Standard Gaussian prior does not allow cell-type clustering MixtureVAE replaces the Gaussian prior with a mixture distribution to learn latent clusters
No hierarchical structure between clustering levels IndMoMVAE trains multiple independent mixture branches to explore different clustering granularities
No joint modeling of hierarchical levels MoMixVAE uses a mixture-of-mixtures formulation to jointly learn clusters at multiple hierarchy levels

Graphical models:
Graphical models of MixtureVAE, IndMoMVAE and MoMixVAE


Installation

We use uv, a modern and fast Python package manager.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/raphaelrubrice/scVAE_mva2025.git
cd scVAE_mva2025

# Create and activate environment
uv venv
source .venv/bin/activate
uv sync

Data Pipeline (PBMC)

The full PBMC processing pipeline is implemented in data_pipeline/ and is fully reproducible.

Pipeline stages

  1. Download raw PBMC datasets from 10x Genomics
  2. Load raw 10x matrices into AnnData objects
  3. Annotate cells with a curated 4-level hierarchical cell-type taxonomy
  4. Freeze label vocabularies across datasets
  5. Shard each dataset as .h5ad
  6. Combine shards into a unified AnnCollection
  7. Build stratified train / validation / test PyTorch DataLoaders

The pipeline is compatible with local execution and Google Colab.

Hierarchical labels

Each cell is annotated with four hierarchical levels:

Level Meaning
1 Stem vs Non-stem
2 Major lineage (B / NK / T)
3 Intermediate lineage (CD4 / CD8)
4 Terminal subtype

Label metadata and dataset URLs are defined in data_pipeline/src/config.py.

Experiment Summary

Experiment Description Notebook / Script Run on Colab
MixtureVAE Trains a mixture-prior VAE on the combined PBMC dataset PBMC_experiments_MixtureVAE.ipynb Open In Colab
IndMoMVAE Independent mixture branches to explore multiple clustering granularities PBMC_experiments_IndMoMVAE.ipynb Open In Colab
MoMixVAE Joint hierarchical clustering using a mixture-of-mixtures model PBMC_experiments_MoMixVAE.ipynb Open In Colab

Evaluation Metrics

We evaluate models using complementary metrics:

  • IWAE log-likelihood
    Measures generative modeling quality

  • Weighted F1-score
    Measures clustering quality after Hungarian label alignment

  • Adjusted Rand Index (ARI)
    Measures agreement between predicted and true clusters

Observations

  • Student-t latent priors improve generative quality
  • MoMixVAE provides the best balance between reconstruction and clustering
  • F1-score is more stable than ARI on nonlinear latent manifolds

Project Structure

scVAE_mva2025/
├── mixture_vae/
│   ├── mvae.py              # MixtureVAE, IndMoMVAE, MoMixVAE
│   ├── distributions.py    # Latent & likelihood distributions
│   ├── training.py         # Training protocols
│   ├── utils.py            # Metrics & evaluation
│   └── viz.py              # Visualization utilities
│
├── data_pipeline/
│   ├── src/
│   │   ├── downloader.py
│   │   ├── load_anndata.py
│   │   ├── combine.py
│   │   ├── dataloader.py
│   │   └── config.py
│
├── PBMC_experiments*.ipynb
├── pyproject.toml
├── uv.lock
└── README.md

Key Contributions

  • Full reimplementation of scVAE
  • Hierarchical Mixture-of-Mixtures formulation
  • Robust and reproducible PBMC data pipeline
  • Stable training via KL warm-up and marginal regularization
  • Extensive quantitative and qualitative evaluation

References

  • Grønbech et al., scVAE: Variational Auto-Encoders for Single-Cell Gene Expression Data, Bioinformatics, 2020
  • Kingma & Welling, Auto-Encoding Variational Bayes, ICLR 2014

About

Github for the project on VAEs for single cell data for the "Introduction to Probabilistic Graphical Models and Deep Generative Models" course of the MVA.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors