You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
End-to-end bulk RNA-seq pipeline in Nextflow DSL2: raw FASTQ reads through quality control, alignment, gene quantification, and differential expression. Every step runs in its own Docker container.
7
+
End-to-end bulk RNA-seq pipeline in Nextflow DSL2: raw FASTQ reads through quality control, adapter trimming, genome alignment, gene quantification, differential expression, and aggregated QC reporting. Every step runs in its own Docker container.
7
8
8
-
Designed for the GSE152075 SARS-CoV-2 nasopharyngeal dataset (6 samples: 3 COVID-positive, 3 negative). For the full covariate-adjusted statistical analysis on 484 samples, see [bulk-rnaseq-differential-expression](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression).
9
+
Demonstrated on the [Himes et al. (2014)](https://doi.org/10.1371/journal.pone.0099625) airway smooth muscle dataset — dexamethasone-treated vs untreated human airway cells. For covariate-adjusted analysis on a larger COVID-19 cohort, see [bulk-rnaseq-differential-expression](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression).
All containers sourced from [BioContainers](https://biocontainers.pro/). No custom Dockerfiles.
50
+
All containers sourced from [BioContainers](https://biocontainers.pro/).
51
+
52
+
## Dataset
53
+
54
+
**Himes et al. (2014)** — RNA-seq of human airway smooth muscle cells treated with dexamethasone (a glucocorticoid anti-inflammatory). GEO accession [GSE52778](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778). This dataset is used in the [DESeq2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) and the [Bioconductor RNA-seq workflow](https://www.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html).
│ ├── create_test_data.py Generate synthetic test data
111
-
│ ├── samplesheet.csv Test sample metadata
112
-
│ ├── genome.fa Synthetic genome (50 genes)
113
-
│ └── genes.gtf Synthetic annotation
114
-
├── LICENSE
115
-
└── README.md
116
-
```
117
-
118
125
## Design Decisions
119
126
120
-
-**HISAT2 over STAR** — runs on 8GB RAM. STAR requires 32GB for the human genome index. Anyone can clone and run this pipeline.
121
-
-**BioContainers, not custom Dockerfiles** — industry standard, maintained by the community, reproducible without building.
122
-
-**Separate samtools process** — HISAT2 and samtools in their own containers. Clean separation of concerns.
123
-
-**Test profile** — synthetic 50-gene genome with reads sampled from the reference. Runs in ~2 minutes. Verifies the pipeline without downloading 30GB of real data.
124
-
-**DESeq2 dispersion fallback** — handles small test datasets where standard dispersion fitting fails. Uses gene-wise estimates when the mean-dispersion trend cannot be fitted.
125
-
-**Configurable strandedness** — `--strandedness` parameter for featureCounts. Default reverse-stranded (standard for Illumina dUTP protocols), unstranded for test data.
127
+
-**HISAT2 over STAR** — runs on 8GB RAM. STAR requires 32GB for the human genome. Accessible on any machine.
128
+
-**BioContainers** — published, maintained Docker containers. No custom builds.
129
+
-**Configurable reference level** — `--ref_condition` sets the DESeq2 baseline. Works with any experimental design.
130
+
-**Adaptive gene filter** — automatically adjusts minimum count threshold based on library size (stringent for real data, permissive for test data).
131
+
-**Test profile** — synthetic 50-gene genome with genome-sampled reads. Verifies the full pipeline in ~2 minutes.
0 commit comments