Skip to content

Commit 7515205

Browse files
committed
RNA-seq Nextflow pipeline: FastQC, fastp, HISAT2, featureCounts, DESeq2, MultiQC
End-to-end bulk RNA-seq pipeline in Nextflow DSL2. 7 processes, each in its own BioContainers Docker container. Designed for GSE152075 (SARS-CoV-2 nasopharyngeal RNA-seq). Includes synthetic test data (50-gene genome, 4 samples) that verifies the full pipeline in ~2 minutes without downloading real data.
0 parents  commit 7515205

26 files changed

Lines changed: 1137 additions & 0 deletions

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
work/
2+
results/
3+
.nextflow/
4+
.nextflow.log*
5+
*.html

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2026 Ekin Kahraman
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# RNA-seq Nextflow Pipeline
2+
3+
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
4+
[![Nextflow](https://img.shields.io/badge/Nextflow-%E2%89%A524.0-brightgreen)](https://www.nextflow.io/)
5+
6+
End-to-end bulk RNA-seq pipeline in Nextflow DSL2: raw FASTQ reads through quality control, alignment, gene quantification, and differential expression. Every step runs in its own Docker container.
7+
8+
Designed for the GSE152075 SARS-CoV-2 nasopharyngeal dataset (6 samples: 3 COVID-positive, 3 negative). For the full covariate-adjusted statistical analysis on 484 samples, see [bulk-rnaseq-differential-expression](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression).
9+
10+
## Workflow
11+
12+
```
13+
FASTQ (paired-end)
14+
15+
16+
FastQC ──────── Raw read quality assessment
17+
18+
19+
fastp ────────── Adapter trimming, quality filtering
20+
21+
22+
HISAT2 ──────── Align to GRCh38 reference genome
23+
24+
25+
samtools ─────── Sort and index BAM
26+
27+
28+
featureCounts ── Gene-level quantification
29+
30+
31+
DESeq2 ──────── Differential expression (positive vs negative)
32+
33+
34+
MultiQC ─────── Aggregate QC report
35+
```
36+
37+
## Processes
38+
39+
| Process | Tool | Container |
40+
|---------|------|-----------|
41+
| FASTQC_RAW | FastQC 0.12.1 | `quay.io/biocontainers/fastqc` |
42+
| FASTP | fastp 0.23.4 | `quay.io/biocontainers/fastp` |
43+
| HISAT2_ALIGN | HISAT2 2.2.1 | `quay.io/biocontainers/hisat2` |
44+
| SAMTOOLS_SORT | samtools 1.21 | `quay.io/biocontainers/samtools` |
45+
| FEATURECOUNTS | Subread 2.0.6 | `quay.io/biocontainers/subread` |
46+
| DESEQ2 | DESeq2 1.42 + ggplot2 | `quay.io/biocontainers/bioconductor-deseq2` |
47+
| MULTIQC | MultiQC 1.27 | `quay.io/biocontainers/multiqc` |
48+
49+
All containers sourced from [BioContainers](https://biocontainers.pro/). No custom Dockerfiles.
50+
51+
## Quick Start
52+
53+
**Prerequisites:** [Nextflow](https://www.nextflow.io/) (>=24.0), [Docker](https://www.docker.com/), Java (>=11)
54+
55+
### Test data (synthetic, runs in ~2 minutes)
56+
57+
```bash
58+
git clone https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline.git
59+
cd rnaseq-nextflow-pipeline
60+
python test/create_test_data.py
61+
nextflow run main.nf -profile test,docker \
62+
--genome_index "$(pwd)/test/genome" \
63+
--gtf "$(pwd)/test/genes.gtf"
64+
```
65+
66+
### Real data (GSE152075)
67+
68+
```bash
69+
# Download HISAT2 GRCh38 index (~4GB)
70+
# Download FASTQ files from SRA (see assets/samplesheet.csv for accessions)
71+
72+
nextflow run main.nf -profile docker \
73+
--genome_index /path/to/grch38/genome \
74+
--gtf /path/to/gencode.v38.annotation.gtf \
75+
--samplesheet assets/samplesheet.csv
76+
```
77+
78+
## Parameters
79+
80+
| Parameter | Default | Description |
81+
|-----------|---------|-------------|
82+
| `--samplesheet` | `assets/samplesheet.csv` | CSV with columns: sample_id, fastq_1, fastq_2, condition |
83+
| `--genome_index` | required | HISAT2 index prefix |
84+
| `--gtf` | required | Gene annotation GTF |
85+
| `--outdir` | `results` | Output directory |
86+
| `--strandedness` | `2` (reverse) | featureCounts strandedness (0/1/2) |
87+
88+
## Output
89+
90+
```
91+
results/
92+
├── fastqc_raw/ Raw read QC reports (HTML + ZIP)
93+
├── fastp/ Trimming reports (JSON)
94+
├── hisat2/ Alignment logs
95+
├── bam/ Sorted BAM files
96+
├── counts/ Gene count matrix (featureCounts)
97+
├── deseq2/ DE results CSV, volcano plot, PCA plot
98+
└── multiqc/ Aggregated QC report
99+
```
100+
101+
## Project Structure
102+
103+
```
104+
rnaseq-nextflow-pipeline/
105+
├── main.nf Pipeline (7 processes, Nextflow DSL2)
106+
├── nextflow.config Parameters, containers, profiles
107+
├── assets/
108+
│ └── samplesheet.csv Sample metadata (SRA accessions)
109+
├── test/
110+
│ ├── create_test_data.py Generate synthetic test data
111+
│ ├── samplesheet.csv Test sample metadata
112+
│ ├── genome.fa Synthetic genome (50 genes)
113+
│ └── genes.gtf Synthetic annotation
114+
├── LICENSE
115+
└── README.md
116+
```
117+
118+
## Design Decisions
119+
120+
- **HISAT2 over STAR** — runs on 8GB RAM. STAR requires 32GB for the human genome index. Anyone can clone and run this pipeline.
121+
- **BioContainers, not custom Dockerfiles** — industry standard, maintained by the community, reproducible without building.
122+
- **Separate samtools process** — HISAT2 and samtools in their own containers. Clean separation of concerns.
123+
- **Test profile** — synthetic 50-gene genome with reads sampled from the reference. Runs in ~2 minutes. Verifies the pipeline without downloading 30GB of real data.
124+
- **DESeq2 dispersion fallback** — handles small test datasets where standard dispersion fitting fails. Uses gene-wise estimates when the mean-dispersion trend cannot be fitted.
125+
- **Configurable strandedness**`--strandedness` parameter for featureCounts. Default reverse-stranded (standard for Illumina dUTP protocols), unstranded for test data.
126+
127+
## Licence
128+
129+
MIT

assets/samplesheet.csv

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
sample_id,fastq_1,fastq_2,condition
2+
SRR11886868,data/SRR11886868_1.fastq.gz,data/SRR11886868_2.fastq.gz,positive
3+
SRR11886869,data/SRR11886869_1.fastq.gz,data/SRR11886869_2.fastq.gz,positive
4+
SRR11886870,data/SRR11886870_1.fastq.gz,data/SRR11886870_2.fastq.gz,positive
5+
SRR11886871,data/SRR11886871_1.fastq.gz,data/SRR11886871_2.fastq.gz,negative
6+
SRR11886872,data/SRR11886872_1.fastq.gz,data/SRR11886872_2.fastq.gz,negative
7+
SRR11886873,data/SRR11886873_1.fastq.gz,data/SRR11886873_2.fastq.gz,negative

0 commit comments

Comments
 (0)