Skip to content

szczurek-lab/CLING

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLING

Cross-view Latent Integration via Nonparametric Gamma Shrinkage Factor Analysis - an unsupervised multi-view Bayesian factor model for integrating heterogeneous datasets with automatic factor selection.

Paper | Example | Data

image

Paper

Available soon.

Basic Usage

See the testing notebooks in the methods folder for detailed examples.

from CLING.cling import ClingFA

# Load your multi-view data as list of numpy arrays
# views = [view1_data, view2_data, view3_data]  # each is N x D_m

# Initialize CLING with automatic K selection
model = ClingFA.from_numpy_views(
    views=views,
    K=None,  # automatically determined
    center=True,
    init_mode="pca"
)

# Fit the model with automatic factor discovery
elbos, K_history, egamma_history = model.fit(
    max_iter=1000,
    prune_every=50,
    add_every=50,
    verbose=True
)

# Extract results
factors = model.get_factors()  # N x K latent factors
weights = model.get_weights()  # list of D_m x K loading matrices
variance_explained = model.variance_explained_per_view()

CLING Code

Folder CLING contains the implementation of CLING and ablation variants:

  • cling.py: Main CLING model with variational inference and automatic factor selection
  • cling_ablation1.py: Ablation with single-Gamma prior (CLING_MGP)
  • cling_ablation2.py: Ablation with ARD-style shrinkage (CLING_ARD)

Folder methods contains baseline implementations and testing notebooks:

  • internal_models_cling.py: Wrapper functions for running CLING
  • external_models_*.py: Baseline implementations (MOFA, MuVi, PCA, Tucker)
  • test_cling_functions.ipynb: Example notebook for CLING usage
  • test_*_functions.ipynb: Testing notebooks for baseline methods

Simulations

See the simulations folder.

We provide code to generate synthetic multi-view data with known ground truth factors (simulations/cling_sparsity_sim.py), varying:

  • Number of factors (K ∈ {1, 5, 10, 15, 20, 25})
  • Noise levels (σ ∈ {0.5, 0.75, ..., 2.0})
  • Sparsity levels (1-θ ∈ {0.65, 0.70, ..., 0.85})

The simulations/scripts folder contains code to run experiments across all scenarios and methods. Results and figures are saved in simulations/results and simulations/figures. The notebook simulations/plots.ipynb generates performance comparison plots.

Benchmarks and Real-World Datasets

The paper evaluates CLING on three biological datasets. The datasets used are freely available for download:

  • Evo-Devo: Developmental bulk RNA-seq across species and organs

    • Reference: Cardoso-Moreira et al., Nature, 2019

    • 5 views defined by organs (brain, cerebellum, heart, liver, testis) across 5 species

    • Download: MEFISTO study repository

  • scNMT-seq: Single-cell multi-omics during mouse gastrulation

    • Reference: Argelaguet et al., Nature, 2019
    • 3 views: RNA expression, DNA methylation, chromatin accessibility
    • 1,518 single cells across developmental stages E6.5 and E7.5
    • Download: GEO accession GSE121708
  • GBM: The Cancer Genome Atlas Glioblastoma Multiforme dataset

    • Reference: Brennan et al., Cell, 2013
    • 2 views: gene expression and DNA methylation
    • 360 patients
    • Download: TCGA portal or cBioPortal

Performance

CLING consistently outperforms or matches established baselines (MOFA, MuVi, PCA, Tucker) across all evaluation metrics:

  • Factor Recovery: Accurately infers the true number of latent factors
  • Latent Structure: Highest Spearman correlation between true and inferred factors
  • Feature Selection: Best Jaccard index for identifying important features
  • Variance Explained: Comparable or superior to competing methods
  • Downstream Tasks: Superior predictive performance for biological covariates

Biological Interpretation

On GBM data, CLING identifies 54 latent factors associated with:

  • Gene Expression Subtypes (Proneural, Neural, Classical, Mesenchymal, G-CIMP)
  • Methylation Status (C1-C6 categories)
  • Clinical Variables (MGMT status, disease-free status, patient age)

Gene set enrichment analysis reveals factors linked to:

  • DNA repair pathways
  • G2-M checkpoint regulation
  • E2F and MYC target genes
  • Proliferative programs in tumor subtypes

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors