You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reproducible bulk RNA-seq differential expression pipeline using DESeq2: QC, PCA, ~1,900 DE genes, ISG enrichment, and mechanistic interpretation of antiviral host responses.
8
+
Reproducible bulk RNA-seq differential expression pipeline using DESeq2: QC, shrunken-effect DE analysis, pathway enrichment, and robustness benchmarking against the full QC-passed cohort.
9
9
10
10
## Highlights
11
11
12
12
- Processed GEO [GSE152075](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152075) (n=484 nasopharyngeal swabs) → balanced subset (n=60) for robust DE analysis
13
-
- Identified **1,902 DE genes** (FDR < 0.05, |log₂FC| > 1), dominated by interferon-stimulated genes (IFIT1/2/3, OAS3, DDX58)
14
-
-Enriched pathways: GO "response to virus", KEGG "Coronavirus disease – COVID-19" (FDR = 1.5×10<sup>-40</sup>)
15
-
-Full reproducible R workflow (DESeq2, clusterProfiler) with modular scripts and fixed seeds
16
-
-Results align with Lieberman *et al.* (2020), who reported viral-load-dependent ISG induction
13
+
- Identified **1,902 thresholded DE genes**in the balanced subset (FDR < 0.05, |log₂FC| > 1), dominated by canonical interferon-stimulated genes
14
+
-Full-cohort sensitivity analysis identified **4,371 thresholded DE genes**, with **1,314** shared with the balanced analysis and **99.7%** effect-direction concordance
15
+
-Enriched pathways: GO "response to virus", KEGG "Coronavirus disease – COVID-19" (FDR = 4.5×10<sup>-39</sup>)
16
+
-Raw and shrunken DE outputs, analysis summary metrics, and git/session provenance are generated automatically
17
17
18
18
## Methods Overview
19
19
20
20
- Bulk RNA-seq preprocessing and quality control
21
-
- Differential expression modelling with DESeq2
21
+
- Differential expression modelling with DESeq2 plus `apeglm` log2 fold-change shrinkage
22
22
- Variance-stabilising transformation (VST) for visualisation
23
23
- Functional enrichment analysis (GO/KEGG)
24
-
- Reproducible analysis workflow (pinned dependencies via `renv`, fixed random seeds)
24
+
- Full-cohort robustness benchmark against the balanced subset
| Total samples | 484 (430 positive, 54 negative) |
40
41
| Analysis subset | 60 (30 per group, balanced) |
41
42
42
-
Balanced subset controls for class imbalance and viral load heterogeneity. Subsampling uses `set.seed(123)` for reproducibility.
43
+
Balanced subset controls for class imbalance and viral load heterogeneity. Subsampling uses `set.seed(123)` for reproducibility, and a separate full-cohort sensitivity analysis quantifies how much of the inferred signal persists outside the balanced subset.
43
44
44
45
## Results
45
46
@@ -61,21 +62,13 @@ PC1 (33% variance) partially separates infected from control samples. Overlap re
Results dominated by interferon-stimulated genes (ISGs) characteristic of antiviral immunity.
67
+
Results are dominated by interferon-stimulated genes (ISGs) characteristic of antiviral immunity. Ranking and volcano visualization use shrunken log2 fold changes to stabilize effect-size estimates for lower-count genes while preserving the raw significance calls.
| GBP1 | Guanylate-binding protein | 2.7 | <10<sup>-19</sup> |
71
+
The most consistently induced genes include **IFIT1/2/3, CXCL10, DDX58, GBP1, OAS3, XAF1, and SIGLEC1**. These genes anchor the interpretation around interferon signaling, viral RNA sensing, and downstream antiviral effector programs rather than isolated single-gene effects.
79
72
80
73
### Model Diagnostics
81
74
@@ -107,6 +100,12 @@ Top GO terms: cytoplasmic translation, response to virus, defense response to vi
107
100
108
101
Top KEGG pathway: **Coronavirus disease - COVID-19** (FDR = 4.5×10<sup>-39</sup>), followed by NOD-like receptor signalling.
The full QC-passed cohort analysis (n=484) identified **4,371 thresholded DE genes**. Of these, **1,314** overlap with the balanced-subset DE set, with **99.7%** shared effect-direction concordance and a Spearman correlation of **0.816** between shrunken effect sizes across all shared genes. This indicates that the balanced subset sharpens contrast but does not invert the core biological signal.
@@ -115,7 +114,9 @@ Schematic of RIG-I → IFN → ISG antiviral cascade. Viral RNA detection by DDX
115
114
116
115
## Biological Interpretation
117
116
118
-
The transcriptional signature is consistent with innate antiviral immunity. DDX58 (RIG-I) detects viral RNA, signalling proceeds via MAVS/TBK1/IRF3/7, and type I interferon responses induce canonical ISGs (including IFIT1/2/3, OAS3, and CXCL10). This pattern is expected in acute infection and supports the observed enrichment of antiviral pathways.
117
+
The dominant signal is an **upper-airway interferon-driven antiviral host response**. Canonical ISGs such as IFIT1/2/3, OAS3, DDX58, CXCL10, GBP1, and SIGLEC1 support activation of RNA-sensing and interferon-response programs that are expected during acute viral infection. The pathway results reinforce this reading, with strong enrichment for antiviral and coronavirus-associated gene sets.
118
+
119
+
At the same time, the interpretation should stay conservative. This signature is **consistent with** acute SARS-CoV-2 infection in nasopharyngeal samples, but it is not uniquely SARS-CoV-2-specific and should not be read as proof of cell-intrinsic mechanism or direct pathway activation in every cell type. The balanced subset was chosen to reduce class imbalance and viral-load heterogeneity; the new full-cohort sensitivity analysis shows that the main direction of effect is highly stable, which makes the interpretation stronger, but the biological claims should still be framed as a robust host-response signature rather than a definitive mechanistic model.
- Nasopharyngeal samples only; may not reflect lower respiratory tract
240
-
- Subset analysis reduces power but improves class balance (full cohort runs showed noisier PCA from viral-load imbalance)
250
+
- Primary inference still uses a simple `~ condition` design without explicit age/sex/viral-load covariates
251
+
- The balanced subset improves comparability, but the full cohort remains heterogeneous and likely reflects cell-composition shifts as well as transcriptional regulation
241
252
- Cross-sectional design; no temporal dynamics
242
-
- Future extensions: full cohort analysis, batch correction assessment, or integration with scRNA-seq
253
+
- Future extensions: covariate-aware modelling, batch correction assessment, cell-type deconvolution, or integration with scRNA-seq
The balanced subset selection uses a fixed seed (`set.seed(123)` in `scripts/01_qc.R`) so repeated runs should yield the same subset and downstream results, given the same package versions.
50
+
The balanced subset selection uses a fixed seed (`set.seed(123)` in `scripts/01_qc.R`) so repeated runs should yield the same subset and downstream results, given the same package versions. Figure label placement for `ggrepel`-based figures is also seeded, and `results/session_info.txt` now records the active git commit, branch, and analysis configuration.
50
51
51
52
## Expected Outputs
52
53
After a successful run, you should see (among others):
53
54
-`results/tables/deseq2_results.csv`
55
+
-`results/tables/deseq2_results_shrunken.csv`
56
+
-`results/tables/full_cohort_deseq2_results.csv`
57
+
-`results/tables/analysis_summary.csv`
54
58
-`results/figures/volcano_plot.png`
59
+
-`results/figures/sensitivity_lfc_scatter.png`
55
60
-`results/figures/pca_plot.png`
56
-
-`results/session_info.txt` (records R + package versions for the run)
61
+
-`results/session_info.txt` (records R, package versions, git commit, and config for the run)
57
62
58
63
## Verification Commands
59
64
```sh
@@ -68,3 +73,5 @@ Rscript dev/lint.R
68
73
```
69
74
70
75
GitHub Actions also performs a clean rebuild of the tracked analysis outputs and checks that regenerated key outputs match the committed versions.
76
+
77
+
For a workflow-level comparison against DESeq2, nf-core/rnaseq, `targets`, and `workflowr`, see `WORKFLOW_BENCHMARK.md`.
This repository starts from a published GEO count matrix, so the benchmark below focuses on **differential-expression and reproducibility workflow quality**, not on FASTQ-level alignment or quantification.
4
+
5
+
## Reference workflows
6
+
7
+
-[DESeq2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html): recommends effect-size shrinkage for ranking and visualization.
8
+
-[nf-core/rnaseq](https://nf-co.re/rnaseq/latest/): exemplifies top-tier end-to-end RNA-seq workflow engineering, especially standardized QC and reproducible execution from raw reads.
9
+
-[`targets` user manual](https://books.ropensci.org/targets/): exemplifies dependency-aware skipping and pipeline orchestration.
10
+
-[workflowr](https://jdblischak.github.io/workflowr/): exemplifies research provenance via git-aware reporting and session/environment capture.
11
+
12
+
## Current alignment
13
+
14
+
-**DE effect estimation**: raw DESeq2 inference plus `apeglm`-shrunken log2 fold changes for ranking, MA plotting, and volcano visualization.
15
+
-**Robustness**: balanced-subset analysis is benchmarked against the full QC-passed cohort, with overlap/concordance written to `results/tables/analysis_summary.csv`.
16
+
-**Reproducibility**: `renv`-pinned environment, deterministic seeds, GitHub Actions rebuilds, and explicit git/session provenance in `results/session_info.txt`.
17
+
-**Artifact validation**: tracked tables are rebuilt in CI and compared against committed results; key figures are also diff-checked.
18
+
19
+
## Still narrower than top-tier end-to-end workflows
20
+
21
+
-**Upstream RNA-seq processing**: unlike `nf-core/rnaseq`, this repo does not perform raw-read QC, alignment, quantification, or MultiQC because the starting point is the GEO count matrix.
22
+
-**Pipeline engine**: the workflow is still a sequential R-script orchestrator rather than a declarative DAG like `targets`.
23
+
-**Model complexity**: the primary DE model remains `~ condition`; covariates such as age, sex, viral load, or inferred cell composition are not yet modeled explicitly.
24
+
25
+
## Why this is still a reasonable design
26
+
27
+
- The codebase is intentionally small and reviewable for a focused secondary analysis.
28
+
- The new robustness layer addresses the highest-risk biological weakness without forcing a full re-architecture.
29
+
- The remaining gaps are now explicit and documented, which makes future extensions decision-ready rather than hidden assumptions.
0 commit comments