Cross-view Latent Integration via Nonparametric Gamma Shrinkage Factor Analysis - an unsupervised multi-view Bayesian factor model for integrating heterogeneous datasets with automatic factor selection.
Available soon.
See the testing notebooks in the methods folder for detailed examples.
from CLING.cling import ClingFA
# Load your multi-view data as list of numpy arrays
# views = [view1_data, view2_data, view3_data] # each is N x D_m
# Initialize CLING with automatic K selection
model = ClingFA.from_numpy_views(
views=views,
K=None, # automatically determined
center=True,
init_mode="pca"
)
# Fit the model with automatic factor discovery
elbos, K_history, egamma_history = model.fit(
max_iter=1000,
prune_every=50,
add_every=50,
verbose=True
)
# Extract results
factors = model.get_factors() # N x K latent factors
weights = model.get_weights() # list of D_m x K loading matrices
variance_explained = model.variance_explained_per_view()Folder CLING contains the implementation of CLING and ablation variants:
cling.py: Main CLING model with variational inference and automatic factor selectioncling_ablation1.py: Ablation with single-Gamma prior (CLING_MGP)cling_ablation2.py: Ablation with ARD-style shrinkage (CLING_ARD)
Folder methods contains baseline implementations and testing notebooks:
internal_models_cling.py: Wrapper functions for running CLINGexternal_models_*.py: Baseline implementations (MOFA, MuVi, PCA, Tucker)test_cling_functions.ipynb: Example notebook for CLING usagetest_*_functions.ipynb: Testing notebooks for baseline methods
See the simulations folder.
We provide code to generate synthetic multi-view data with known ground truth factors (simulations/cling_sparsity_sim.py), varying:
- Number of factors (K ∈ {1, 5, 10, 15, 20, 25})
- Noise levels (σ ∈ {0.5, 0.75, ..., 2.0})
- Sparsity levels (1-θ ∈ {0.65, 0.70, ..., 0.85})
The simulations/scripts folder contains code to run experiments across all scenarios and methods. Results and figures are saved in simulations/results and simulations/figures. The notebook simulations/plots.ipynb generates performance comparison plots.
The paper evaluates CLING on three biological datasets. The datasets used are freely available for download:
-
Evo-Devo: Developmental bulk RNA-seq across species and organs
-
Reference: Cardoso-Moreira et al., Nature, 2019
-
5 views defined by organs (brain, cerebellum, heart, liver, testis) across 5 species
-
Download: MEFISTO study repository
-
-
scNMT-seq: Single-cell multi-omics during mouse gastrulation
- Reference: Argelaguet et al., Nature, 2019
- 3 views: RNA expression, DNA methylation, chromatin accessibility
- 1,518 single cells across developmental stages E6.5 and E7.5
- Download: GEO accession GSE121708
-
GBM: The Cancer Genome Atlas Glioblastoma Multiforme dataset
- Reference: Brennan et al., Cell, 2013
- 2 views: gene expression and DNA methylation
- 360 patients
- Download: TCGA portal or cBioPortal
CLING consistently outperforms or matches established baselines (MOFA, MuVi, PCA, Tucker) across all evaluation metrics:
- Factor Recovery: Accurately infers the true number of latent factors
- Latent Structure: Highest Spearman correlation between true and inferred factors
- Feature Selection: Best Jaccard index for identifying important features
- Variance Explained: Comparable or superior to competing methods
- Downstream Tasks: Superior predictive performance for biological covariates
On GBM data, CLING identifies 54 latent factors associated with:
- Gene Expression Subtypes (Proneural, Neural, Classical, Mesenchymal, G-CIMP)
- Methylation Status (C1-C6 categories)
- Clinical Variables (MGMT status, disease-free status, patient age)
Gene set enrichment analysis reveals factors linked to:
- DNA repair pathways
- G2-M checkpoint regulation
- E2F and MYC target genes
- Proliferative programs in tumor subtypes