PRISM — Protein Regulation and Interaction-based Score Modelling

What makes a plasma protein genetically predictable — and why do some escape local genetic control?

PRISM is a nine-step R analysis pipeline that models which genomic, regulatory, and network features determine how well plasma protein abundance can be predicted from genetic variants (pQTL R²) across 1,359 human proteins. The pipeline covers feature engineering, regression and random forest modelling, STRING network analysis, mixed models, and pathway enrichment, with all downstream analyses performed on residualized outcomes that remove technical confounders.

Repository structure

PRISM/
├── R/
│   ├── run_all.R                  # Orchestrator — runs all steps in sequence
│   ├── step01_load_and_clean.R    # Data loading, merging, transforms, residualization
│   ├── step02_eda.R               # EDA and distribution plots
│   ├── step03_feature_matrix.R    # Feature groups (5 categories, 19 features), VIF checks
│   ├── step04_regression.R        # Incremental OLS + elastic net (cis / trans / genome-wide)
│   ├── step05_rf_ppi.R            # Random Forest variable importance + PPI stratification
│   ├── step06_visualisation.R     # PCA, observed vs predicted scatter plots
│   ├── step07_string_network.R    # STRING network, Louvain modules, RWR, decile analysis
│   ├── step08_mixed_model.R       # Mixed model, module permutation tests, ICC
│   ├── step09_enrichment.R        # Protein classification, GSEA/ORA, module enrichment
│   ├── step_model_panel.R         # Combined 3×3 model comparison figure (OLS/EN/RF)
│   ├── step_module_figure.R       # Combined module heatmap + enrichment figure
│   └── step_enrichment_panel.R    # Combined enrichment panel figure
├── data/                          # Raw input files (committed) + generated intermediates (gitignored)
├── figs/                          # Generated figures, organised by step (gitignored)
├── outputs/                       # Generated result CSVs (gitignored)
├── viewer/                        # FastAPI + Plotly interactive results viewer
├── logs/                          # Pipeline run logs (gitignored)
└── prism_clean_column_dictionary.md

Pipeline overview

Step	Script	Key outputs
01	`step01_load_and_clean.R`	`data/prism_clean.csv` — merged, cleaned, INT-transformed, with residualized outcomes
02	`step02_eda.R`	Distribution plots, cis vs trans scatter, correlation heatmap
03	`step03_feature_matrix.R`	`data/feature_matrix.csv` — 19 features across 5 groups, VIF < 10
04	`step04_regression.R`	Incremental OLS ΔR² by feature group + elastic net coefficients
05	`step05_rf_ppi.R`	RF variable importance (OOB R²: cis=0.71, trans=0.70, gw=0.65) + PPI stratification
06	`step06_visualisation.R`	PCA of feature space, observed vs predicted plots
07	`step07_string_network.R`	STRING network (threshold=700), 225 Louvain modules, RWR, decile subnetwork analysis
08	`step08_mixed_model.R`	Mixed model ICC, permutation tests (n=10,000 per module)
09	`step09_enrichment.R`	Protein classification, GSEA/ORA (4 residualized outcomes), per-module GO:BP enrichment

Outcome variables

Variable	Description
`R2_cis`	Predictive R² from cis-only pQTL model (local genetic architecture)
`R2_trans`	Additional R² gained by adding trans SNPs beyond cis
`R2_genome_wide`	Predictive R² from full genome-wide SNP model
`R2_ratio`	R²_trans / R²_cis (proteins with R²_cis > 0.001 only)

Residualized outcomes (computed in step01, used in steps 07–09):

Technical confounders (R2_cov, prot_expr, model_sparsity) are regressed out and residuals are z-scored. trans_resid additionally removes the contribution of R2_cis_int.

Residual	Confounder model
`cis_resid`	`R2_cis_int ~ R2_cov + prot_expr + model_sparsity`
`trans_resid`	`R2_trans_int ~ R2_cis_int + R2_cov + prot_expr + model_sparsity`
`gw_resid`	`R2_genome_wide_int ~ R2_cov + prot_expr + model_sparsity`
`ratio_resid`	`R2_ratio_int ~ R2_cov + prot_expr + model_sparsity`

Required input data

File	Size	Description
`data/Full_model.csv`	~650 KB	Feature table (genomic, regulatory, constraint, pQTL per protein)
`data/R2_data.csv`	~190 KB	Outcome table (R²_cis, R²_trans, R²_genome_wide per protein)
`data/9606.protein.links.v12.0.txt.gz`	80 MB	STRING v12 interaction network — download
`data/9606.protein.info.v12.0.txt.gz`	1.9 MB	STRING v12 protein metadata — download
`data/9606.protein.enrichment.terms.v12.0.txt.gz`	38 MB	STRING v12 functional annotations — download

Note: Raw column naming in source CSVs is non-standard — GeneSymbol contains Ensembl gene IDs and gene contains HGNC symbols. Step 01 renames these correctly.

Installation

R dependencies

# CRAN
install.packages(c(
  "ggplot2", "ggrepel", "ggridges",   # visualisation
  "glmnet",                            # elastic net
  "randomForest",                      # random forest
  "lme4",                              # mixed models
  "igraph",                            # network analysis
  "pheatmap",                          # heatmaps
  "jsonlite",                          # JSON export
  "dplyr", "readr", "scales",          # data manipulation
  "gridExtra", "grid",                 # figure layout
  "msigdbr"                            # MSigDB gene sets (step 09)
))

# Bioconductor (step 09)
if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install(c("clusterProfiler", "org.Hs.eg.db", "enrichplot", "fgsea"))

Interactive viewer (optional)

python3 -m venv .viewer-venv
source .viewer-venv/bin/activate
pip install -r viewer/requirements.txt

Running the pipeline

# Full pipeline (~10–15 min including GSEA)
Rscript R/run_all.R 2>&1 | tee logs/pipeline_run_$(date +%Y%m%d).log

# Individual steps
Rscript R/step01_load_and_clean.R   # must run first
Rscript R/step07_string_network.R   # requires STRING files
Rscript R/step09_enrichment.R       # requires steps 01 + 07 + 08

# Figure panels (run after full pipeline)
Rscript R/step_model_panel.R
Rscript R/step_module_figure.R
Rscript R/step_enrichment_panel.R

Environment variable overrides

Variable	Default	Description
`PRISM_DATA_DIR`	`./data`	Location of raw input CSVs and STRING files
`PRISM_OUTPUT_DIR`	`./data`	Intermediate R outputs
`PRISM_RESULTS_DIR`	`./outputs`	Model result CSVs
`PRISM_FIGS_DIR`	`./figs`	Base figures directory

Interactive results viewer

A read-only FastAPI + Plotly dashboard for exploring pipeline outputs.

source .viewer-venv/bin/activate
uvicorn viewer.app:app --host 127.0.0.1 --port 8000

Access via SSH tunnel from your local machine:

ssh -N -L 8000:127.0.0.1:8000 <cluster-login>
# then open http://127.0.0.1:8000/

Key findings

pQTL architecture dominates R² prediction across all three outcomes (largest ΔR² of any feature group)
Signalling and immune proteins have significantly higher trans_resid (β=+0.69, FDR=2.4×10⁻¹²), while intracellular/mitochondrial proteins have lower trans_resid
GSEA reveals pathway-level cis escape: all 107 significant gene sets for cis_resid show negative NES — hormonal response, hemostasis, and platelet activation pathway proteins are uniformly less locally heritable than expected
Trans-driven proteins are STRING network hubs: high R²_trans decile proteins form denser subnetworks (D10 mean degree=2.0 vs D1=0.5)
Module ICC=0.011 after residualization (down from 0.022 on raw R²) — technical confounders explained ~half the apparent module clustering signal

Pipeline verification

Step	Expected
01	1,359 proteins; residualized outcomes: cis adj.R²=0.091, trans adj.R²=0.559, gw adj.R²=0.291
03	All 19 features with VIF < 10
05	RF OOB R²: cis=0.712, trans=0.702, gw=0.650
07	1,357/1,359 proteins matched to STRING; 225 Louvain modules; RWR converges at ~37 iterations
08	ICC=0.011; ≥1 significant module (q_BH<0.05)
09	107 significant GSEA sets for cis_resid; 15 modules biologically labelled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRISM — Protein Regulation and Interaction-based Score Modelling

Repository structure

Pipeline overview

Outcome variables

Required input data

Installation

R dependencies

Interactive viewer (optional)

Running the pipeline

Environment variable overrides

Interactive results viewer

Key findings

Pipeline verification

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PRISM — Protein Regulation and Interaction-based Score Modelling

Repository structure

Pipeline overview

Outcome variables

Required input data

Installation

R dependencies

Interactive viewer (optional)

Running the pipeline

Environment variable overrides

Interactive results viewer

Key findings

Pipeline verification