Releases: Ekin-Kahraman/bulk-rnaseq-differential-expression
v2.1.0 — Covariate-adjusted model + workflow diagram
Changes since v2.0.0
- Primary DE model adjusted for sex covariate (
~ condition + gender). Addresses the main biological weakness: unadjusted model despite having sex metadata. Results regenerated: 1,773 DE genes (was 1,902), 99.8% sign concordance with full cohort. - Workflow diagram added to README (ASCII pipeline visualization).
- NA gender filtering added before balanced sampling and full-cohort analysis.
- All figures, tables, and tests regenerated with the covariate-adjusted model. CI passes including full pipeline rebuild.
v2.0.0 — Extended covariate analyses
What's New
Extended Analyses
- Viral load stratification: High vs low Ct differential expression with ISG dose-response correlation (Script 10)
- Sex-stratified interaction model: condition x gender interaction effects identifying sex-biased genes (Script 11)
Key Results
- 1,510 DE genes between high/low viral load groups
- 12 genes with significant condition x sex interaction (9 male-biased, 3 female-biased)
- ISG dose-dependent gradient confirmed across viral load strata
Infrastructure
- Cross-platform numeric tolerance for CI reproducibility
- Updated KEGG pathway database compatibility
- All linting issues resolved
PRs Merged
- #3: Add viral load stratification and sex-stratified interaction analyses
- #4: Rename Novel to Extended in script comments
- #5: Fix lint spacing in expression operators
- #6: Update KEGG pathway table for current database version
- #7: Widen numeric tolerance for cross-platform reproducibility
- #8: Bump tolerance to 1e-3 for cross-platform p-value drift
v1.1.2
Archival release of a reproducible bulk RNA-seq differential expression workflow for nasopharyngeal SARS-CoV-2 host-response analysis using GEO GSE152075.
Scope
This repository starts from the published count matrix and sample metadata provided through GEO. It does not perform raw-read processing, alignment, or quantification. The focus is downstream differential expression, enrichment analysis, reproducibility, and result validation.
Contents
- Quality control and balanced subset construction
- PCA and exploratory visualization
- DESeq2 differential expression analysis
apeglmlog2 fold-change shrinkage for ranking and visualization- GO Biological Process and KEGG pathway enrichment
- Full-cohort sensitivity analysis
- Committed figures and result tables
- Pinned software environment via
renv - GitHub Actions checks for environment consistency, linting, tests, and rebuild validation
Main outputs
- Balanced primary analysis: 1,902 thresholded differentially expressed genes
- Full-cohort sensitivity analysis: 4,371 thresholded differentially expressed genes
- Shared thresholded genes: 1,314
- Shared effect-direction concordance: 99.7%
Reproducibility
From the repository root:
Rscript 000_install_dependencies.RRscript run_all.RRscript -e 'renv::status()'Rscript dev/lint.RRscript -e 'testthat::test_dir("tests/testthat")'
Positioning
Relative to broader workflow standards, the repository includes pinned dependencies, deterministic seeds, committed derived outputs, CI rebuild checks, and explicit session provenance. It remains intentionally lightweight and reviewable, while leaving upstream FASTQ-level processing and covariate-rich modeling out of scope.
Data source
Lieberman NAP et al. (2020). In vivo antiviral host transcriptional response to SARS-CoV-2 by viral load, sex, and age. PLoS Biology 18(9): e3000849.
GEO accession: GSE152075
DOI: https://doi.org/10.1371/journal.pbio.3000849
Robustness and reproducibility update
Summary
This release improves reproducibility for the SARS-CoV-2 host-response bulk RNA-seq pipeline (GSE152075), with no intended changes to the core analysis design.
What changed
- Hardened data ingestion in
scripts/00_get_data.R:- stricter sample ID validation
- explicit checks for count/metadata overlap
- controlled handling of unknown condition labels
- Improved QC safety in
scripts/01_qc.R:- fail fast checks for sample/metadata alignment
- explicit handling when one condition has too few samples
- Improved enrichment robustness in
scripts/06_enrichment.R:- explicit stop when no DE genes pass thresholds
- graceful KEGG failure handling for transient network/service issues
- deterministic writing of expected enrichment output tables
- Expanded smoke tests in
tests/testthat/test-smoke.R:- p-value sanity checks
- verification of key derived enrichment tables
- Documentation update in
README.md:- added “Data and Code Availability”
- added “Peer Review Checklist”
Validation
renv::status()cleanRscript dev/lint.Rpassestestthat::test_dir("tests/testthat")passes- GitHub Actions CI passes on
main
Data and citation
- Source dataset: GEO
GSE152075 - Repository DOI (Zenodo concept DOI):
10.5281/zenodo.18432519 CITATION.cffincluded for citation metadata
v1.1.0 - Reproducibility and bug fixes
What's Changed
- Added
run_all.Rfor one-command pipeline execution - Added
000_install_dependencies.Rfor easy setup - Added input validation to all scripts
- Added results tables (CSV files)
- Fixed library size description (20M reads)
- Fixed NaN handling in heatmap scaling
- Fixed scree plot assignment before ggsave
- Removed temp files from repo
Quick Start
source("000_install_dependencies.R")
source("run_all.R")v1.0.1 - MA plot fix and documentation updates
Fixed MA plot rendering issue.
v1.0.0 - Bulk RNA-seq differential expression pipeline
Reproducible bulk RNA-seq analysis pipeline for SARS-CoV-2 host response (GEO GSE152075).
Key results:
- 1,902 differentially expressed genes (FDR < 0.05, |log₂FC| > 1)
- Top pathway: Coronavirus disease - COVID-19 (FDR = 1.5×10⁻⁴⁰)
- 529 enriched GO terms, 28 KEGG pathways
Pipeline:
- Quality control and filtering
- PCA and exploratory analysis
- DESeq2 differential expression
- GO/KEGG pathway enrichment
- Model diagnostics and visualization
Reference: Lieberman et al. (2020) PLoS Biology
License: MIT