This repository contains a reproducible workflow for analyzing plasma proteomics data using:
- Linear modeling (limma) for differential protein expression
- Gene Set Enrichment Analysis (GSEA, preranked via
gseapy)
The pipeline is designed for multi-cohort clinical proteomics data with repeated measures and confounding variables.
Metadata table describing each sample.
| Column | Description |
|---|---|
| Sample ID | Patient identifier (prefix indicates country: SWE, IRE, ITA, NOR) |
| TP | Timepoint (T0, T1, T2) |
| Sex | Biological sex (M / F) |
| CF ID | Unique sample identifier (used to map quantification data) |
| cancer_type | Cancer type |
| toxicity | Toxicity status (yes / no) |
| country | Country of origin |
| ... | Additional clinical covariates |
Mapping table linking CF IDs to raw file names.
| Column | Description |
|---|---|
| Sample ID | CF ID |
| R.FileName | Identifier used in quantification matrix |
Protein quantification matrix.
| Column | Description |
|---|---|
| PG.Genes | Protein / gene annotation |
| Other columns | Sample-specific intensities (mapped via meta.tsv) |
- Match quantification columns to CF IDs using
meta.tsv - Rename columns to CF IDs
- Multi-annotation entries (e.g.
ARHGEF5;ARHGEF5;...) are simplified to:
- Rows → samples (CF IDs)
- Columns → proteins
- Values → normalized intensities
- Remove duplicated samples
- Convert to numeric
- Log2 transform
We use the limma framework for linear modeling with empirical Bayes moderation.
- Supports flexible contrasts:
Sex: M vs FTP: T2 vs T0toxicity: yes vs no- Adjusts for confounders:
- country
- cancer_type
- TP
- toxicity
results <- limma_differential(
expr = expr_matrix,
sample_info = sample_info,
contrast_feature = "Sex",
contrast_direction = "M_vs_F",
confounding_factors = c("country", "cancer_type", "TP", "toxicity")
)