Skip to content

Latest commit

 

History

History
188 lines (144 loc) · 8.7 KB

File metadata and controls

188 lines (144 loc) · 8.7 KB

PRISM — Protein Regulation and Interaction-based Score Modelling

What makes a plasma protein genetically predictable — and why do some escape local genetic control?

PRISM is a nine-step R analysis pipeline that models which genomic, regulatory, and network features determine how well plasma protein abundance can be predicted from genetic variants (pQTL R²) across 1,359 human proteins. The pipeline covers feature engineering, regression and random forest modelling, STRING network analysis, mixed models, and pathway enrichment, with all downstream analyses performed on residualized outcomes that remove technical confounders.


Repository structure

PRISM/
├── R/
│   ├── run_all.R                  # Orchestrator — runs all steps in sequence
│   ├── step01_load_and_clean.R    # Data loading, merging, transforms, residualization
│   ├── step02_eda.R               # EDA and distribution plots
│   ├── step03_feature_matrix.R    # Feature groups (5 categories, 19 features), VIF checks
│   ├── step04_regression.R        # Incremental OLS + elastic net (cis / trans / genome-wide)
│   ├── step05_rf_ppi.R            # Random Forest variable importance + PPI stratification
│   ├── step06_visualisation.R     # PCA, observed vs predicted scatter plots
│   ├── step07_string_network.R    # STRING network, Louvain modules, RWR, decile analysis
│   ├── step08_mixed_model.R       # Mixed model, module permutation tests, ICC
│   ├── step09_enrichment.R        # Protein classification, GSEA/ORA, module enrichment
│   ├── step_model_panel.R         # Combined 3×3 model comparison figure (OLS/EN/RF)
│   ├── step_module_figure.R       # Combined module heatmap + enrichment figure
│   └── step_enrichment_panel.R    # Combined enrichment panel figure
├── data/                          # Raw input files (committed) + generated intermediates (gitignored)
├── figs/                          # Generated figures, organised by step (gitignored)
├── outputs/                       # Generated result CSVs (gitignored)
├── viewer/                        # FastAPI + Plotly interactive results viewer
├── logs/                          # Pipeline run logs (gitignored)
└── prism_clean_column_dictionary.md

Pipeline overview

Step Script Key outputs
01 step01_load_and_clean.R data/prism_clean.csv — merged, cleaned, INT-transformed, with residualized outcomes
02 step02_eda.R Distribution plots, cis vs trans scatter, correlation heatmap
03 step03_feature_matrix.R data/feature_matrix.csv — 19 features across 5 groups, VIF < 10
04 step04_regression.R Incremental OLS ΔR² by feature group + elastic net coefficients
05 step05_rf_ppi.R RF variable importance (OOB R²: cis=0.71, trans=0.70, gw=0.65) + PPI stratification
06 step06_visualisation.R PCA of feature space, observed vs predicted plots
07 step07_string_network.R STRING network (threshold=700), 225 Louvain modules, RWR, decile subnetwork analysis
08 step08_mixed_model.R Mixed model ICC, permutation tests (n=10,000 per module)
09 step09_enrichment.R Protein classification, GSEA/ORA (4 residualized outcomes), per-module GO:BP enrichment

Outcome variables

Variable Description
R2_cis Predictive R² from cis-only pQTL model (local genetic architecture)
R2_trans Additional R² gained by adding trans SNPs beyond cis
R2_genome_wide Predictive R² from full genome-wide SNP model
R2_ratio R²_trans / R²_cis (proteins with R²_cis > 0.001 only)

Residualized outcomes (computed in step01, used in steps 07–09):

Technical confounders (R2_cov, prot_expr, model_sparsity) are regressed out and residuals are z-scored. trans_resid additionally removes the contribution of R2_cis_int.

Residual Confounder model
cis_resid R2_cis_int ~ R2_cov + prot_expr + model_sparsity
trans_resid R2_trans_int ~ R2_cis_int + R2_cov + prot_expr + model_sparsity
gw_resid R2_genome_wide_int ~ R2_cov + prot_expr + model_sparsity
ratio_resid R2_ratio_int ~ R2_cov + prot_expr + model_sparsity

Required input data

File Size Description
data/Full_model.csv ~650 KB Feature table (genomic, regulatory, constraint, pQTL per protein)
data/R2_data.csv ~190 KB Outcome table (R²_cis, R²_trans, R²_genome_wide per protein)
data/9606.protein.links.v12.0.txt.gz 80 MB STRING v12 interaction network — download
data/9606.protein.info.v12.0.txt.gz 1.9 MB STRING v12 protein metadata — download
data/9606.protein.enrichment.terms.v12.0.txt.gz 38 MB STRING v12 functional annotations — download

Note: Raw column naming in source CSVs is non-standard — GeneSymbol contains Ensembl gene IDs and gene contains HGNC symbols. Step 01 renames these correctly.


Installation

R dependencies

# CRAN
install.packages(c(
  "ggplot2", "ggrepel", "ggridges",   # visualisation
  "glmnet",                            # elastic net
  "randomForest",                      # random forest
  "lme4",                              # mixed models
  "igraph",                            # network analysis
  "pheatmap",                          # heatmaps
  "jsonlite",                          # JSON export
  "dplyr", "readr", "scales",          # data manipulation
  "gridExtra", "grid",                 # figure layout
  "msigdbr"                            # MSigDB gene sets (step 09)
))

# Bioconductor (step 09)
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "enrichplot", "fgsea"))

Interactive viewer (optional)

python3 -m venv .viewer-venv
source .viewer-venv/bin/activate
pip install -r viewer/requirements.txt

Running the pipeline

# Full pipeline (~10–15 min including GSEA)
Rscript R/run_all.R 2>&1 | tee logs/pipeline_run_$(date +%Y%m%d).log

# Individual steps
Rscript R/step01_load_and_clean.R   # must run first
Rscript R/step07_string_network.R   # requires STRING files
Rscript R/step09_enrichment.R       # requires steps 01 + 07 + 08

# Figure panels (run after full pipeline)
Rscript R/step_model_panel.R
Rscript R/step_module_figure.R
Rscript R/step_enrichment_panel.R

Environment variable overrides

Variable Default Description
PRISM_DATA_DIR ./data Location of raw input CSVs and STRING files
PRISM_OUTPUT_DIR ./data Intermediate R outputs
PRISM_RESULTS_DIR ./outputs Model result CSVs
PRISM_FIGS_DIR ./figs Base figures directory

Interactive results viewer

A read-only FastAPI + Plotly dashboard for exploring pipeline outputs.

source .viewer-venv/bin/activate
uvicorn viewer.app:app --host 127.0.0.1 --port 8000

Access via SSH tunnel from your local machine:

ssh -N -L 8000:127.0.0.1:8000 <cluster-login>
# then open http://127.0.0.1:8000/

Key findings

  • pQTL architecture dominates R² prediction across all three outcomes (largest ΔR² of any feature group)
  • Signalling and immune proteins have significantly higher trans_resid (β=+0.69, FDR=2.4×10⁻¹²), while intracellular/mitochondrial proteins have lower trans_resid
  • GSEA reveals pathway-level cis escape: all 107 significant gene sets for cis_resid show negative NES — hormonal response, hemostasis, and platelet activation pathway proteins are uniformly less locally heritable than expected
  • Trans-driven proteins are STRING network hubs: high R²_trans decile proteins form denser subnetworks (D10 mean degree=2.0 vs D1=0.5)
  • Module ICC=0.011 after residualization (down from 0.022 on raw R²) — technical confounders explained ~half the apparent module clustering signal

Pipeline verification

Step Expected
01 1,359 proteins; residualized outcomes: cis adj.R²=0.091, trans adj.R²=0.559, gw adj.R²=0.291
03 All 19 features with VIF < 10
05 RF OOB R²: cis=0.712, trans=0.702, gw=0.650
07 1,357/1,359 proteins matched to STRING; 225 Louvain modules; RWR converges at ~37 iterations
08 ICC=0.011; ≥1 significant module (q_BH<0.05)
09 107 significant GSEA sets for cis_resid; 15 modules biologically labelled