Skip to content

Commit 4ad44bf

Browse files
committed
Add robustness benchmark and tighten interpretation
1 parent b21bdbd commit 4ad44bf

22 files changed

Lines changed: 35746 additions & 75 deletions

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,5 +83,7 @@ jobs:
8383
run: >
8484
git diff --exit-code --
8585
results/figures/qc_library_size.png
86+
results/figures/ma_plot.png
8687
results/figures/pca_plot.png
88+
results/figures/sensitivity_lfc_scatter.png
8789
results/figures/volcano_plot.png

README.md

Lines changed: 36 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,24 @@
55
[![DOI](https://zenodo.org/badge/1142001317.svg)](https://doi.org/10.5281/zenodo.18432519)
66
[![CI](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression/actions/workflows/ci.yml/badge.svg)](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression/actions/workflows/ci.yml)
77

8-
Reproducible bulk RNA-seq differential expression pipeline using DESeq2: QC, PCA, ~1,900 DE genes, ISG enrichment, and mechanistic interpretation of antiviral host responses.
8+
Reproducible bulk RNA-seq differential expression pipeline using DESeq2: QC, shrunken-effect DE analysis, pathway enrichment, and robustness benchmarking against the full QC-passed cohort.
99

1010
## Highlights
1111

1212
- Processed GEO [GSE152075](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152075) (n=484 nasopharyngeal swabs) → balanced subset (n=60) for robust DE analysis
13-
- Identified **1,902 DE genes** (FDR < 0.05, |log₂FC| > 1), dominated by interferon-stimulated genes (IFIT1/2/3, OAS3, DDX58)
14-
- Enriched pathways: GO "response to virus", KEGG "Coronavirus disease – COVID-19" (FDR = 1.5×10<sup>-40</sup>)
15-
- Full reproducible R workflow (DESeq2, clusterProfiler) with modular scripts and fixed seeds
16-
- Results align with Lieberman *et al.* (2020), who reported viral-load-dependent ISG induction
13+
- Identified **1,902 thresholded DE genes** in the balanced subset (FDR < 0.05, |log₂FC| > 1), dominated by canonical interferon-stimulated genes
14+
- Full-cohort sensitivity analysis identified **4,371 thresholded DE genes**, with **1,314** shared with the balanced analysis and **99.7%** effect-direction concordance
15+
- Enriched pathways: GO "response to virus", KEGG "Coronavirus disease – COVID-19" (FDR = 4.5×10<sup>-39</sup>)
16+
- Raw and shrunken DE outputs, analysis summary metrics, and git/session provenance are generated automatically
1717

1818
## Methods Overview
1919

2020
- Bulk RNA-seq preprocessing and quality control
21-
- Differential expression modelling with DESeq2
21+
- Differential expression modelling with DESeq2 plus `apeglm` log2 fold-change shrinkage
2222
- Variance-stabilising transformation (VST) for visualisation
2323
- Functional enrichment analysis (GO/KEGG)
24-
- Reproducible analysis workflow (pinned dependencies via `renv`, fixed random seeds)
24+
- Full-cohort robustness benchmark against the balanced subset
25+
- Reproducible analysis workflow (pinned dependencies via `renv`, fixed seeds, git/session provenance)
2526

2627
## Dataset
2728

@@ -39,7 +40,7 @@ Lieberman NAP, Peddu V, Xie H, Shrestha L, Huang M-L, Mears MC, *et al.* (2020)
3940
| Total samples | 484 (430 positive, 54 negative) |
4041
| Analysis subset | 60 (30 per group, balanced) |
4142

42-
Balanced subset controls for class imbalance and viral load heterogeneity. Subsampling uses `set.seed(123)` for reproducibility.
43+
Balanced subset controls for class imbalance and viral load heterogeneity. Subsampling uses `set.seed(123)` for reproducibility, and a separate full-cohort sensitivity analysis quantifies how much of the inferred signal persists outside the balanced subset.
4344

4445
## Results
4546

@@ -61,21 +62,13 @@ PC1 (33% variance) partially separates infected from control samples. Overlap re
6162

6263
![Volcano Plot](results/figures/volcano_plot.png)
6364

64-
**1,902 DE genes** (FDR < 0.05, |log₂FC| > 1): 1,099 upregulated, 803 downregulated
65+
**1,902 thresholded DE genes** (FDR < 0.05, |log₂FC| > 1): 1,099 upregulated, 803 downregulated
6566

66-
Results dominated by interferon-stimulated genes (ISGs) characteristic of antiviral immunity.
67+
Results are dominated by interferon-stimulated genes (ISGs) characteristic of antiviral immunity. Ranking and volcano visualization use shrunken log2 fold changes to stabilize effect-size estimates for lower-count genes while preserving the raw significance calls.
6768

68-
### Top ISGs by Effect Size
69+
### Representative Induced Genes
6970

70-
| Gene | Function | log₂FC | FDR |
71-
|:-----|:---------|-------:|----:|
72-
| IFIT1 | Translation inhibitor | 3.5 | <10<sup>-20</sup> |
73-
| IFIT2 | Translation inhibitor | 3.2 | <10<sup>-18</sup> |
74-
| IFIT3 | Translation inhibitor | 3.1 | <10<sup>-17</sup> |
75-
| OAS3 | 2'-5'-Oligoadenylate synthetase | 3.0 | <10<sup>-17</sup> |
76-
| CXCL10 | IFN-inducible chemokine | 2.9 | <10<sup>-20</sup> |
77-
| DDX58 | RIG-I (viral RNA sensor) | 2.8 | <10<sup>-19</sup> |
78-
| GBP1 | Guanylate-binding protein | 2.7 | <10<sup>-19</sup> |
71+
The most consistently induced genes include **IFIT1/2/3, CXCL10, DDX58, GBP1, OAS3, XAF1, and SIGLEC1**. These genes anchor the interpretation around interferon signaling, viral RNA sensing, and downstream antiviral effector programs rather than isolated single-gene effects.
7972

8073
### Model Diagnostics
8174

@@ -107,6 +100,12 @@ Top GO terms: cytoplasmic translation, response to virus, defense response to vi
107100

108101
Top KEGG pathway: **Coronavirus disease - COVID-19** (FDR = 4.5×10<sup>-39</sup>), followed by NOD-like receptor signalling.
109102

103+
### Robustness Check
104+
105+
![Sensitivity Scatter](results/figures/sensitivity_lfc_scatter.png)
106+
107+
The full QC-passed cohort analysis (n=484) identified **4,371 thresholded DE genes**. Of these, **1,314** overlap with the balanced-subset DE set, with **99.7%** shared effect-direction concordance and a Spearman correlation of **0.816** between shrunken effect sizes across all shared genes. This indicates that the balanced subset sharpens contrast but does not invert the core biological signal.
108+
110109
### ISG Signalling Cascade
111110

112111
![Pathway Diagram](results/figures/pathway_diagram.png)
@@ -115,7 +114,9 @@ Schematic of RIG-I → IFN → ISG antiviral cascade. Viral RNA detection by DDX
115114

116115
## Biological Interpretation
117116

118-
The transcriptional signature is consistent with innate antiviral immunity. DDX58 (RIG-I) detects viral RNA, signalling proceeds via MAVS/TBK1/IRF3/7, and type I interferon responses induce canonical ISGs (including IFIT1/2/3, OAS3, and CXCL10). This pattern is expected in acute infection and supports the observed enrichment of antiviral pathways.
117+
The dominant signal is an **upper-airway interferon-driven antiviral host response**. Canonical ISGs such as IFIT1/2/3, OAS3, DDX58, CXCL10, GBP1, and SIGLEC1 support activation of RNA-sensing and interferon-response programs that are expected during acute viral infection. The pathway results reinforce this reading, with strong enrichment for antiviral and coronavirus-associated gene sets.
118+
119+
At the same time, the interpretation should stay conservative. This signature is **consistent with** acute SARS-CoV-2 infection in nasopharyngeal samples, but it is not uniquely SARS-CoV-2-specific and should not be read as proof of cell-intrinsic mechanism or direct pathway activation in every cell type. The balanced subset was chosen to reduce class imbalance and viral-load heterogeneity; the new full-cohort sensitivity analysis shows that the main direction of effect is highly stable, which makes the interpretation stronger, but the biological claims should still be framed as a robust host-response signature rather than a definitive mechanistic model.
119120

120121
## Quick Start
121122
```sh
@@ -126,13 +127,14 @@ Rscript 000_install_dependencies.R
126127
Rscript run_all.R
127128
```
128129

129-
Analysis runtime: ~0.5 min after data download (~2GB).
130+
Analysis runtime: ~1.7 min after data download (~2GB).
130131

131132
### Notes
132133
- To re-download the GEO dataset (otherwise the pipeline reuses existing `data/*.rds` outputs): `FORCE_DOWNLOAD=true Rscript scripts/00_get_data.R`
133134
- To continue without KEGG results when the KEGG service is unavailable: `ALLOW_KEGG_FAILURE=true Rscript scripts/06_enrichment.R`
134135
- Lint: `Rscript dev/lint.R`
135136
- Tests: `Rscript -e 'testthat::test_dir("tests/testthat")'`
137+
- Benchmark + design notes: see `WORKFLOW_BENCHMARK.md`
136138
- Reproducibility details (expected outputs, network requirements): see `REPRODUCIBILITY.md`
137139

138140
## Data and Code Availability
@@ -171,6 +173,7 @@ bulk-rnaseq-differential-expression/
171173
├── 000_install_dependencies.R # Install all required packages
172174
├── CITATION.cff
173175
├── REPRODUCIBILITY.md
176+
├── WORKFLOW_BENCHMARK.md
174177
├── dev/
175178
│ ├── lint.R # Lint scripts/ via lintr
176179
│ └── snapshot_lockfile.R # Maintainer-only renv.lock refresh
@@ -189,6 +192,7 @@ bulk-rnaseq-differential-expression/
189192
│ ├── 06_enrichment.R
190193
│ ├── 07_reproducibility.R
191194
│ ├── 08_pathway_diagram.R
195+
│ ├── 09_sensitivity_analysis.R
192196
│ └── config.R # Shared analysis thresholds and helpers
193197
├── data/
194198
│ └── [RDS files]
@@ -212,6 +216,7 @@ source("scripts/05_model_diagnostics.R")
212216
source("scripts/06_enrichment.R")
213217
source("scripts/07_reproducibility.R")
214218
source("scripts/08_pathway_diagram.R")
219+
source("scripts/09_sensitivity_analysis.R")
215220
```
216221

217222
## Methods
@@ -225,10 +230,15 @@ source("scripts/08_pathway_diagram.R")
225230
### Statistical Analysis
226231
- Normalisation: DESeq2 median-of-ratios
227232
- Transformation: Variance-stabilising transformation (VST) for visualisation
228-
- Dispersion: Empirical Bayes shrinkage
233+
- Effect-size stabilisation: `apeglm` shrinkage for ranking/visualisation
229234
- Testing: Wald test with Benjamini-Hochberg correction
230235
- Thresholds: FDR < 0.05, |log₂FC| > 1 (chosen after testing >0.58 and >1.5; this gave the cleanest ISG-dominant signal)
231236

237+
### Robustness Analysis
238+
- Secondary DE run on the full QC-passed cohort (n = 484)
239+
- Summary outputs written to `results/tables/full_cohort_deseq2_results.csv` and `results/tables/analysis_summary.csv`
240+
- Effect-size concordance visualised in `results/figures/sensitivity_lfc_scatter.png`
241+
232242
### Enrichment Analysis
233243
- Gene ID conversion: Symbol → Entrez (96% mapped)
234244
- GO: Biological Process, BH-corrected
@@ -237,9 +247,10 @@ source("scripts/08_pathway_diagram.R")
237247
## Limitations
238248

239249
- Nasopharyngeal samples only; may not reflect lower respiratory tract
240-
- Subset analysis reduces power but improves class balance (full cohort runs showed noisier PCA from viral-load imbalance)
250+
- Primary inference still uses a simple `~ condition` design without explicit age/sex/viral-load covariates
251+
- The balanced subset improves comparability, but the full cohort remains heterogeneous and likely reflects cell-composition shifts as well as transcriptional regulation
241252
- Cross-sectional design; no temporal dynamics
242-
- Future extensions: full cohort analysis, batch correction assessment, or integration with scRNA-seq
253+
- Future extensions: covariate-aware modelling, batch correction assessment, cell-type deconvolution, or integration with scRNA-seq
243254

244255
## Requirements
245256

REPRODUCIBILITY.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ This repository is set up so a reviewer can reproduce the analysis with a small
66
- **Code**: analysis scripts in `scripts/` and an orchestrator in `run_all.R`.
77
- **Version pinning**: `renv.lock` pins CRAN + Bioconductor package versions.
88
- **Pre-computed outputs**: key figures and tables are committed under `results/` for convenience and quick verification.
9+
- **Analysis summary**: `results/tables/analysis_summary.csv` captures the main counts used in the narrative.
910

1011
## From a clean checkout (recommended)
1112
Run these commands from the repository root (i.e., a fresh clone):
@@ -46,14 +47,18 @@ ALLOW_KEGG_FAILURE=true Rscript scripts/06_enrichment.R
4647
```
4748

4849
## Determinism
49-
The balanced subset selection uses a fixed seed (`set.seed(123)` in `scripts/01_qc.R`) so repeated runs should yield the same subset and downstream results, given the same package versions.
50+
The balanced subset selection uses a fixed seed (`set.seed(123)` in `scripts/01_qc.R`) so repeated runs should yield the same subset and downstream results, given the same package versions. Figure label placement for `ggrepel`-based figures is also seeded, and `results/session_info.txt` now records the active git commit, branch, and analysis configuration.
5051

5152
## Expected Outputs
5253
After a successful run, you should see (among others):
5354
- `results/tables/deseq2_results.csv`
55+
- `results/tables/deseq2_results_shrunken.csv`
56+
- `results/tables/full_cohort_deseq2_results.csv`
57+
- `results/tables/analysis_summary.csv`
5458
- `results/figures/volcano_plot.png`
59+
- `results/figures/sensitivity_lfc_scatter.png`
5560
- `results/figures/pca_plot.png`
56-
- `results/session_info.txt` (records R + package versions for the run)
61+
- `results/session_info.txt` (records R, package versions, git commit, and config for the run)
5762

5863
## Verification Commands
5964
```sh
@@ -68,3 +73,5 @@ Rscript dev/lint.R
6873
```
6974

7075
GitHub Actions also performs a clean rebuild of the tracked analysis outputs and checks that regenerated key outputs match the committed versions.
76+
77+
For a workflow-level comparison against DESeq2, nf-core/rnaseq, `targets`, and `workflowr`, see `WORKFLOW_BENCHMARK.md`.

WORKFLOW_BENCHMARK.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Workflow Benchmark
2+
3+
This repository starts from a published GEO count matrix, so the benchmark below focuses on **differential-expression and reproducibility workflow quality**, not on FASTQ-level alignment or quantification.
4+
5+
## Reference workflows
6+
7+
- [DESeq2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html): recommends effect-size shrinkage for ranking and visualization.
8+
- [nf-core/rnaseq](https://nf-co.re/rnaseq/latest/): exemplifies top-tier end-to-end RNA-seq workflow engineering, especially standardized QC and reproducible execution from raw reads.
9+
- [`targets` user manual](https://books.ropensci.org/targets/): exemplifies dependency-aware skipping and pipeline orchestration.
10+
- [workflowr](https://jdblischak.github.io/workflowr/): exemplifies research provenance via git-aware reporting and session/environment capture.
11+
12+
## Current alignment
13+
14+
- **DE effect estimation**: raw DESeq2 inference plus `apeglm`-shrunken log2 fold changes for ranking, MA plotting, and volcano visualization.
15+
- **Robustness**: balanced-subset analysis is benchmarked against the full QC-passed cohort, with overlap/concordance written to `results/tables/analysis_summary.csv`.
16+
- **Reproducibility**: `renv`-pinned environment, deterministic seeds, GitHub Actions rebuilds, and explicit git/session provenance in `results/session_info.txt`.
17+
- **Artifact validation**: tracked tables are rebuilt in CI and compared against committed results; key figures are also diff-checked.
18+
19+
## Still narrower than top-tier end-to-end workflows
20+
21+
- **Upstream RNA-seq processing**: unlike `nf-core/rnaseq`, this repo does not perform raw-read QC, alignment, quantification, or MultiQC because the starting point is the GEO count matrix.
22+
- **Pipeline engine**: the workflow is still a sequential R-script orchestrator rather than a declarative DAG like `targets`.
23+
- **Model complexity**: the primary DE model remains `~ condition`; covariates such as age, sex, viral load, or inferred cell composition are not yet modeled explicitly.
24+
25+
## Why this is still a reasonable design
26+
27+
- The codebase is intentionally small and reviewable for a focused secondary analysis.
28+
- The new robustness layer addresses the highest-risk biological weakness without forcing a full re-architecture.
29+
- The remaining gaps are now explicit and documented, which makes future extensions decision-ready rather than hidden assumptions.

0 commit comments

Comments
 (0)