You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Processed GEO [GSE152075](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152075) (n=484 nasopharyngeal swabs) → balanced subset (n=60) for robust DE analysis
12
+
- Processed GEO [GSE152075](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152075) (n = 484 nasopharyngeal swabs) to a balanced subset (n = 60) for the primary differential expression analysis
13
13
- Identified **1,902 thresholded DE genes** in the balanced subset (FDR < 0.05, |log₂FC| > 1), dominated by canonical interferon-stimulated genes
14
14
- Full-cohort sensitivity analysis identified **4,371 thresholded DE genes**, with **1,314** shared with the balanced analysis and **99.7%** effect-direction concordance
15
-
- Enriched pathways: GO "response to virus", KEGG "Coronavirus disease – COVID-19" (FDR = 4.5×10<sup>-39</sup>)
15
+
- Enriched pathways: GO "response to virus", KEGG "Coronavirus disease - COVID-19" (FDR = 4.5e-39)
16
16
- Raw and shrunken DE outputs, analysis summary metrics, and git/session provenance are generated automatically
The full QC-passed cohort analysis (n=484) identified **4,371 thresholded DE genes**. Of these, **1,314** overlap with the balanced-subset DE set, with **99.7%** shared effect-direction concordance and a Spearman correlation of **0.816** between shrunken effect sizes across all shared genes. This indicates that the balanced subset sharpens contrast but does not invert the core biological signal.
107
+
The full QC-passed cohort analysis (n = 484) identified **4,371 thresholded DE genes**. Of these, **1,314** overlap with the balanced-subset DE set, with **99.7%** shared effect-direction concordance and a Spearman correlation of **0.816** between shrunken effect sizes across shared genes. The balanced subset therefore increases contrast, but the main direction of effect is preserved in the larger cohort.
Schematic of RIG-I → IFN → ISG antiviral cascade. Viral RNA detection by DDX58 (RIG-I) triggers interferon production and downstream activation of antiviral effectors.
113
+
Schematic of the RIG-I -> IFN -> ISG antiviral cascade. Viral RNA detection by DDX58 (RIG-I) triggers interferon production and downstream activation of antiviral effectors.
114
114
115
115
## Biological Interpretation
116
116
117
-
The dominant signal is an **upper-airway interferon-driven antiviral host response**. Canonical ISGs such as IFIT1/2/3, OAS3, DDX58, CXCL10, GBP1, and SIGLEC1 support activation of RNA-sensing and interferon-response programs that are expected during acute viral infection. The pathway results reinforce this reading, with strong enrichment for antiviral and coronavirus-associated gene sets.
117
+
The transcriptional profile is consistent with an upper-airway interferon-driven antiviral host response. Canonical ISGs such as IFIT1/2/3, OAS3, DDX58, CXCL10, GBP1, and SIGLEC1 support activation of RNA-sensing and interferon-response programs expected during acute viral infection. The pathway results are consistent with the same interpretation, with strong enrichment for antiviral and coronavirus-associated gene sets.
118
118
119
-
At the same time, the interpretation should stay conservative. This signature is **consistent with** acute SARS-CoV-2 infection in nasopharyngeal samples, but it is not uniquely SARS-CoV-2-specific and should not be read as proof of cell-intrinsic mechanism or direct pathway activation in every cell type. The balanced subset was chosen to reduce class imbalance and viral-load heterogeneity; the new full-cohort sensitivity analysis shows that the main direction of effect is highly stable, which makes the interpretation stronger, but the biological claims should still be framed as a robust host-response signature rather than a definitive mechanistic model.
119
+
The interpretation should remain conservative. This signature is consistent with acute SARS-CoV-2 infection in nasopharyngeal samples, but it is not uniquely SARS-CoV-2-specific and should not be interpreted as direct proof of cell-intrinsic mechanism or pathway activation in every cell type. The balanced subset was chosen to reduce class imbalance and viral-load heterogeneity. The full-cohort sensitivity analysis shows that the direction of effect is highly stable, which strengthens the inference, but the biological claims should remain framed as a robust host-response signature rather than a definitive mechanistic model.
120
120
121
121
## Quick Start
122
122
```sh
@@ -134,7 +134,7 @@ Analysis runtime: ~1.7 min after data download (~2GB).
134
134
- To continue without KEGG results when the KEGG service is unavailable: `ALLOW_KEGG_FAILURE=true Rscript scripts/06_enrichment.R`
This repository starts from a published GEO count matrix, so the benchmark below focuses on **differential-expression and reproducibility workflow quality**, not on FASTQ-level alignment or quantification.
3
+
This repository starts from a published GEO count matrix, so the benchmark below focuses on differentialexpression and reproducibility workflow quality, not on FASTQ-level alignment or quantification.
4
4
5
5
## Reference workflows
6
6
7
7
-[DESeq2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html): recommends effect-size shrinkage for ranking and visualization.
8
-
-[nf-core/rnaseq](https://nf-co.re/rnaseq/latest/): exemplifies top-tier end-to-end RNA-seq workflow engineering, especially standardized QC and reproducible execution from raw reads.
8
+
-[nf-core/rnaseq](https://nf-co.re/rnaseq/latest/): exemplifies a mature end-to-end RNA-seq workflow, especially standardized QC and reproducible execution from raw reads.
9
9
-[`targets` user manual](https://books.ropensci.org/targets/): exemplifies dependency-aware skipping and pipeline orchestration.
10
10
-[workflowr](https://jdblischak.github.io/workflowr/): exemplifies research provenance via git-aware reporting and session/environment capture.
11
11
12
12
## Current alignment
13
13
14
-
-**DE effect estimation**: raw DESeq2 inference plus `apeglm`-shrunken log2 fold changes for ranking, MA plotting, and volcano visualization.
15
-
-**Robustness**: balanced-subset analysis is benchmarked against the full QC-passed cohort, with overlap/concordance written to `results/tables/analysis_summary.csv`.
16
-
-**Reproducibility**: `renv`-pinned environment, deterministic seeds, GitHub Actions rebuilds, and explicit git/session provenance in `results/session_info.txt`.
17
-
-**Artifact validation**: tracked tables are rebuilt in CI and compared against committed results; key figures are also diff-checked.
14
+
- DE effect estimation: raw DESeq2 inference plus `apeglm`-shrunken log2 fold changes for ranking, MA plotting, and volcano visualization.
15
+
- Robustness: balanced-subset analysis is benchmarked against the full QC-passed cohort, with overlap and concordance written to `results/tables/analysis_summary.csv`.
16
+
- Reproducibility: `renv`-pinned environment, deterministic seeds, GitHub Actions rebuilds, and explicit git/session provenance in `results/session_info.txt`.
17
+
- Artifact validation: tracked tables are rebuilt in CI and compared against committed results; key figures are also diff-checked.
18
18
19
-
## Still narrower than top-tier end-to-end workflows
19
+
## Remaining gaps relative to broader workflows
20
20
21
-
-**Upstream RNA-seq processing**: unlike `nf-core/rnaseq`, this repo does not perform raw-read QC, alignment, quantification, or MultiQC because the starting point is the GEO count matrix.
22
-
-**Pipeline engine**: the workflow is still a sequential R-script orchestrator rather than a declarative DAG like `targets`.
23
-
-**Model complexity**: the primary DE model remains `~ condition`; covariates such as age, sex, viral load, or inferred cell composition are not yet modeled explicitly.
21
+
- Upstream RNA-seq processing: unlike `nf-core/rnaseq`, this repo does not perform raw-read QC, alignment, quantification, or MultiQC because the starting point is the GEO count matrix.
22
+
- Pipeline engine: the workflow remains a sequential R-script orchestrator rather than a declarative DAG like `targets`.
23
+
- Model complexity: the primary DE model remains `~ condition`; covariates such as age, sex, viral load, or inferred cell composition are not yet modeled explicitly.
24
24
25
-
## Why this is still a reasonable design
25
+
## Scope and rationale
26
26
27
27
- The codebase is intentionally small and reviewable for a focused secondary analysis.
28
-
- The new robustness layer addresses the highest-risk biological weakness without forcing a full re-architecture.
29
-
- The remaining gaps are now explicit and documented, which makes future extensions decision-ready rather than hidden assumptions.
28
+
- The robustness layer addresses the main biological risk introduced by the balanced-subset design without requiring a full re-architecture.
29
+
- The remaining gaps are explicit, documented, and suitable targets for future extension.
0 commit comments