Skip to content

Commit ac8edfe

Browse files
committed
Refine repository prose for peer review
1 parent 4ad44bf commit ac8edfe

3 files changed

Lines changed: 23 additions & 23 deletions

File tree

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,10 @@ Reproducible bulk RNA-seq differential expression pipeline using DESeq2: QC, shr
99

1010
## Highlights
1111

12-
- Processed GEO [GSE152075](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152075) (n=484 nasopharyngeal swabs) balanced subset (n=60) for robust DE analysis
12+
- Processed GEO [GSE152075](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152075) (n = 484 nasopharyngeal swabs) to a balanced subset (n = 60) for the primary differential expression analysis
1313
- Identified **1,902 thresholded DE genes** in the balanced subset (FDR < 0.05, |log₂FC| > 1), dominated by canonical interferon-stimulated genes
1414
- Full-cohort sensitivity analysis identified **4,371 thresholded DE genes**, with **1,314** shared with the balanced analysis and **99.7%** effect-direction concordance
15-
- Enriched pathways: GO "response to virus", KEGG "Coronavirus disease COVID-19" (FDR = 4.5×10<sup>-39</sup>)
15+
- Enriched pathways: GO "response to virus", KEGG "Coronavirus disease - COVID-19" (FDR = 4.5e-39)
1616
- Raw and shrunken DE outputs, analysis summary metrics, and git/session provenance are generated automatically
1717

1818
## Methods Overview
@@ -104,19 +104,19 @@ Top KEGG pathway: **Coronavirus disease - COVID-19** (FDR = 4.5×10<sup>-39</sup
104104

105105
![Sensitivity Scatter](results/figures/sensitivity_lfc_scatter.png)
106106

107-
The full QC-passed cohort analysis (n=484) identified **4,371 thresholded DE genes**. Of these, **1,314** overlap with the balanced-subset DE set, with **99.7%** shared effect-direction concordance and a Spearman correlation of **0.816** between shrunken effect sizes across all shared genes. This indicates that the balanced subset sharpens contrast but does not invert the core biological signal.
107+
The full QC-passed cohort analysis (n = 484) identified **4,371 thresholded DE genes**. Of these, **1,314** overlap with the balanced-subset DE set, with **99.7%** shared effect-direction concordance and a Spearman correlation of **0.816** between shrunken effect sizes across shared genes. The balanced subset therefore increases contrast, but the main direction of effect is preserved in the larger cohort.
108108

109109
### ISG Signalling Cascade
110110

111111
![Pathway Diagram](results/figures/pathway_diagram.png)
112112

113-
Schematic of RIG-I IFN ISG antiviral cascade. Viral RNA detection by DDX58 (RIG-I) triggers interferon production and downstream activation of antiviral effectors.
113+
Schematic of the RIG-I -> IFN -> ISG antiviral cascade. Viral RNA detection by DDX58 (RIG-I) triggers interferon production and downstream activation of antiviral effectors.
114114

115115
## Biological Interpretation
116116

117-
The dominant signal is an **upper-airway interferon-driven antiviral host response**. Canonical ISGs such as IFIT1/2/3, OAS3, DDX58, CXCL10, GBP1, and SIGLEC1 support activation of RNA-sensing and interferon-response programs that are expected during acute viral infection. The pathway results reinforce this reading, with strong enrichment for antiviral and coronavirus-associated gene sets.
117+
The transcriptional profile is consistent with an upper-airway interferon-driven antiviral host response. Canonical ISGs such as IFIT1/2/3, OAS3, DDX58, CXCL10, GBP1, and SIGLEC1 support activation of RNA-sensing and interferon-response programs expected during acute viral infection. The pathway results are consistent with the same interpretation, with strong enrichment for antiviral and coronavirus-associated gene sets.
118118

119-
At the same time, the interpretation should stay conservative. This signature is **consistent with** acute SARS-CoV-2 infection in nasopharyngeal samples, but it is not uniquely SARS-CoV-2-specific and should not be read as proof of cell-intrinsic mechanism or direct pathway activation in every cell type. The balanced subset was chosen to reduce class imbalance and viral-load heterogeneity; the new full-cohort sensitivity analysis shows that the main direction of effect is highly stable, which makes the interpretation stronger, but the biological claims should still be framed as a robust host-response signature rather than a definitive mechanistic model.
119+
The interpretation should remain conservative. This signature is consistent with acute SARS-CoV-2 infection in nasopharyngeal samples, but it is not uniquely SARS-CoV-2-specific and should not be interpreted as direct proof of cell-intrinsic mechanism or pathway activation in every cell type. The balanced subset was chosen to reduce class imbalance and viral-load heterogeneity. The full-cohort sensitivity analysis shows that the direction of effect is highly stable, which strengthens the inference, but the biological claims should remain framed as a robust host-response signature rather than a definitive mechanistic model.
120120

121121
## Quick Start
122122
```sh
@@ -134,7 +134,7 @@ Analysis runtime: ~1.7 min after data download (~2GB).
134134
- To continue without KEGG results when the KEGG service is unavailable: `ALLOW_KEGG_FAILURE=true Rscript scripts/06_enrichment.R`
135135
- Lint: `Rscript dev/lint.R`
136136
- Tests: `Rscript -e 'testthat::test_dir("tests/testthat")'`
137-
- Benchmark + design notes: see `WORKFLOW_BENCHMARK.md`
137+
- Workflow benchmark: see `WORKFLOW_BENCHMARK.md`
138138
- Reproducibility details (expected outputs, network requirements): see `REPRODUCIBILITY.md`
139139

140140
## Data and Code Availability
@@ -232,15 +232,15 @@ source("scripts/09_sensitivity_analysis.R")
232232
- Transformation: Variance-stabilising transformation (VST) for visualisation
233233
- Effect-size stabilisation: `apeglm` shrinkage for ranking/visualisation
234234
- Testing: Wald test with Benjamini-Hochberg correction
235-
- Thresholds: FDR < 0.05, |log₂FC| > 1 (chosen after testing >0.58 and >1.5; this gave the cleanest ISG-dominant signal)
235+
- Thresholds: FDR < 0.05, |log₂FC| > 1 for the reported summaries; full results tables are provided for alternative thresholding
236236

237237
### Robustness Analysis
238238
- Secondary DE run on the full QC-passed cohort (n = 484)
239239
- Summary outputs written to `results/tables/full_cohort_deseq2_results.csv` and `results/tables/analysis_summary.csv`
240240
- Effect-size concordance visualised in `results/figures/sensitivity_lfc_scatter.png`
241241

242242
### Enrichment Analysis
243-
- Gene ID conversion: Symbol Entrez (96% mapped)
243+
- Gene ID conversion: Symbol to Entrez (96% mapped)
244244
- GO: Biological Process, BH-corrected
245245
- KEGG: Human pathways (hsa)
246246

REPRODUCIBILITY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This repository is set up so a reviewer can reproduce the analysis with a small number of commands.
44

5-
## Whats Included
5+
## What's Included
66
- **Code**: analysis scripts in `scripts/` and an orchestrator in `run_all.R`.
77
- **Version pinning**: `renv.lock` pins CRAN + Bioconductor package versions.
88
- **Pre-computed outputs**: key figures and tables are committed under `results/` for convenience and quick verification.

WORKFLOW_BENCHMARK.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,29 @@
11
# Workflow Benchmark
22

3-
This repository starts from a published GEO count matrix, so the benchmark below focuses on **differential-expression and reproducibility workflow quality**, not on FASTQ-level alignment or quantification.
3+
This repository starts from a published GEO count matrix, so the benchmark below focuses on differential expression and reproducibility workflow quality, not on FASTQ-level alignment or quantification.
44

55
## Reference workflows
66

77
- [DESeq2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html): recommends effect-size shrinkage for ranking and visualization.
8-
- [nf-core/rnaseq](https://nf-co.re/rnaseq/latest/): exemplifies top-tier end-to-end RNA-seq workflow engineering, especially standardized QC and reproducible execution from raw reads.
8+
- [nf-core/rnaseq](https://nf-co.re/rnaseq/latest/): exemplifies a mature end-to-end RNA-seq workflow, especially standardized QC and reproducible execution from raw reads.
99
- [`targets` user manual](https://books.ropensci.org/targets/): exemplifies dependency-aware skipping and pipeline orchestration.
1010
- [workflowr](https://jdblischak.github.io/workflowr/): exemplifies research provenance via git-aware reporting and session/environment capture.
1111

1212
## Current alignment
1313

14-
- **DE effect estimation**: raw DESeq2 inference plus `apeglm`-shrunken log2 fold changes for ranking, MA plotting, and volcano visualization.
15-
- **Robustness**: balanced-subset analysis is benchmarked against the full QC-passed cohort, with overlap/concordance written to `results/tables/analysis_summary.csv`.
16-
- **Reproducibility**: `renv`-pinned environment, deterministic seeds, GitHub Actions rebuilds, and explicit git/session provenance in `results/session_info.txt`.
17-
- **Artifact validation**: tracked tables are rebuilt in CI and compared against committed results; key figures are also diff-checked.
14+
- DE effect estimation: raw DESeq2 inference plus `apeglm`-shrunken log2 fold changes for ranking, MA plotting, and volcano visualization.
15+
- Robustness: balanced-subset analysis is benchmarked against the full QC-passed cohort, with overlap and concordance written to `results/tables/analysis_summary.csv`.
16+
- Reproducibility: `renv`-pinned environment, deterministic seeds, GitHub Actions rebuilds, and explicit git/session provenance in `results/session_info.txt`.
17+
- Artifact validation: tracked tables are rebuilt in CI and compared against committed results; key figures are also diff-checked.
1818

19-
## Still narrower than top-tier end-to-end workflows
19+
## Remaining gaps relative to broader workflows
2020

21-
- **Upstream RNA-seq processing**: unlike `nf-core/rnaseq`, this repo does not perform raw-read QC, alignment, quantification, or MultiQC because the starting point is the GEO count matrix.
22-
- **Pipeline engine**: the workflow is still a sequential R-script orchestrator rather than a declarative DAG like `targets`.
23-
- **Model complexity**: the primary DE model remains `~ condition`; covariates such as age, sex, viral load, or inferred cell composition are not yet modeled explicitly.
21+
- Upstream RNA-seq processing: unlike `nf-core/rnaseq`, this repo does not perform raw-read QC, alignment, quantification, or MultiQC because the starting point is the GEO count matrix.
22+
- Pipeline engine: the workflow remains a sequential R-script orchestrator rather than a declarative DAG like `targets`.
23+
- Model complexity: the primary DE model remains `~ condition`; covariates such as age, sex, viral load, or inferred cell composition are not yet modeled explicitly.
2424

25-
## Why this is still a reasonable design
25+
## Scope and rationale
2626

2727
- The codebase is intentionally small and reviewable for a focused secondary analysis.
28-
- The new robustness layer addresses the highest-risk biological weakness without forcing a full re-architecture.
29-
- The remaining gaps are now explicit and documented, which makes future extensions decision-ready rather than hidden assumptions.
28+
- The robustness layer addresses the main biological risk introduced by the balanced-subset design without requiring a full re-architecture.
29+
- The remaining gaps are explicit, documented, and suitable targets for future extension.

0 commit comments

Comments
 (0)