Skip to content

Commit e57cbc6

Browse files
committed
ci: pin KEGG enrichment reference
1 parent 05ff601 commit e57cbc6

7 files changed

Lines changed: 39814 additions & 48 deletions

File tree

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Reproducible bulk RNA-seq differential expression pipeline using DESeq2: QC, shr
1212
- Processed GEO [GSE152075](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152075) (n = 484 nasopharyngeal swabs) to a balanced subset (n = 60) for the primary differential expression analysis
1313
- Identified **1,773 thresholded DE genes** in the balanced subset (FDR < 0.05, |log₂FC| > 1), dominated by canonical interferon-stimulated genes
1414
- Full-cohort sensitivity analysis identified **4,378 thresholded DE genes**, with **1,266** shared with the balanced analysis and **99.8%** effect-direction concordance
15-
- Enriched pathways: GO "response to virus", KEGG "Coronavirus disease - COVID-19" (FDR = 2.9e-39)
15+
- Enriched pathways: GO "response to virus", KEGG "Coronavirus disease - COVID-19" (FDR = 4.5e-39)
1616
- **Extended: Viral load stratification** — COVID-positive samples stratified by N1 Ct value into high/low viral load groups with independent DE analysis and continuous ISG–Ct correlation, extending the original continuous regression approach with a group-comparison framework
1717
- **Extended: Sex-stratified interaction analysis** — Condition-by-sex interaction model (`~ condition * gender`) to identify genes with sex-differential transcriptional responses, complementing the original study's sex-adjusted analysis with a formal interaction test
1818
- Extracts full GEO covariates (viral load Ct, age, sex, sequencing batch) for covariate-aware analyses
@@ -37,7 +37,7 @@ GSE152075 (n=484, GEO)
3737
3838
├──→ 04 Sensitivity ─── Full cohort (n=484) DE → concordance check (99.8% sign agreement)
3939
├──→ 05 Diagnostics ─── Cook's distance, dispersion, MA, volcano, scree
40-
├──→ 06 Enrichment ──── GO/KEGG via clusterProfiler (top KEGG: "Coronavirus disease", FDR=2.9e-39)
40+
├──→ 06 Enrichment ──── GO/KEGG via clusterProfiler (top KEGG: "Coronavirus disease", FDR=4.5e-39)
4141
├──→ 08 Viral load ──── High/low Ct stratification → independent DE + ISG-Ct correlation
4242
└──→ 09 Sex interaction ── ~ condition * gender → 12 sex-biased genes (9 male, 3 female)
4343
@@ -54,7 +54,7 @@ GSE152075 (n=484, GEO)
5454
- Full-cohort robustness benchmark against the balanced subset
5555
- Viral load stratification: median-split DE analysis of high vs low viral load patients
5656
- Sex-stratified interaction model: `~ condition * gender` to identify sex-differential host responses
57-
- Reproducible analysis workflow (pinned dependencies via `renv`, fixed seeds, git/session provenance)
57+
- Reproducible analysis workflow (pinned dependencies via `renv`, fixed seeds, pinned KEGG snapshot, git/session provenance)
5858

5959
## Dataset
6060

@@ -130,7 +130,7 @@ Top GO terms: cytoplasmic translation, response to virus, defense response to vi
130130

131131
![KEGG Enrichment](results/figures/kegg_dotplot.png)
132132

133-
Top KEGG pathway: **Coronavirus disease - COVID-19** (FDR = 2.9×10<sup>-39</sup>), followed by Ribosome and NOD-like receptor signalling.
133+
Top KEGG pathway: **Coronavirus disease - COVID-19** (FDR = 4.5×10<sup>-39</sup>), followed by Ribosome and NOD-like receptor signalling.
134134

135135
### Robustness Check
136136

@@ -183,11 +183,11 @@ Rscript 000_install_dependencies.R
183183
Rscript run_all.R
184184
```
185185

186-
Analysis runtime: ~1.7 min after data download (~2GB).
186+
Analysis runtime: ~7-10 min after data download (~2GB), depending on CPU and CI runner load.
187187

188188
### Notes
189189
- To re-download the GEO dataset (otherwise the pipeline reuses existing `data/*.rds` outputs): `FORCE_DOWNLOAD=true Rscript scripts/00_get_data.R`
190-
- To continue without KEGG results when the KEGG service is unavailable: `ALLOW_KEGG_FAILURE=true Rscript scripts/06_enrichment.R`
190+
- KEGG enrichment uses the pinned human pathway snapshot in `data/reference/` so routine rebuilds do not silently change when KEGG updates upstream.
191191
- Lint: `Rscript dev/lint.R`
192192
- Tests: `Rscript -e 'testthat::test_dir("tests/testthat")'`
193193
- Workflow benchmark: see `WORKFLOW_BENCHMARK.md`

REPRODUCIBILITY.md

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ This repository is set up so a reviewer can reproduce the analysis with a small
77
- **Version pinning**: `renv.lock` pins CRAN + Bioconductor package versions.
88
- **Pre-computed outputs**: key figures and tables are committed under `results/` for convenience and quick verification.
99
- **Analysis summary**: `results/tables/analysis_summary.csv` captures the main counts used in the narrative.
10+
- **Pinned pathway snapshot**: `data/reference/kegg_hsa_pathway_*.tsv` freezes the KEGG human pathway universe used by enrichment.
1011

1112
## From a clean checkout (recommended)
1213
Run these commands from the repository root (i.e., a fresh clone):
@@ -35,16 +36,7 @@ FORCE_DOWNLOAD=true Rscript scripts/00_get_data.R
3536
```
3637

3738
## Network dependencies
38-
Some steps require network access:
39-
- GEO download (via `GEOquery`) in `scripts/00_get_data.R`
40-
- KEGG pathway annotation (via KEGG REST) in `scripts/06_enrichment.R`
41-
42-
If you are running in a restricted environment, these steps may fail until network access is available.
43-
If KEGG is temporarily unavailable and you still want the pipeline to continue locally, run:
44-
45-
```sh
46-
ALLOW_KEGG_FAILURE=true Rscript scripts/06_enrichment.R
47-
```
39+
The GEO download step (`scripts/00_get_data.R`) requires network access on first run. KEGG enrichment does not query live KEGG during routine analysis; it reads the pinned human pathway snapshot in `data/reference/` so exact table comparisons remain meaningful when KEGG changes upstream.
4840

4941
## Determinism
5042
The balanced subset selection uses a fixed seed (`set.seed(123)` in `scripts/01_qc.R`) so repeated runs should yield the same subset and downstream results, given the same package versions. Figure label placement for `ggrepel`-based figures is also seeded, and `results/session_info.txt` now records the active git commit, branch, and analysis configuration.

0 commit comments

Comments
 (0)