Bulk RNA-seq pipeline in Nextflow DSL2. Takes paired-end FASTQ reads from raw sequencing output through to differential expression results - QC, trimming, alignment, counting, and DESeq2 - with each step containerised via Docker or Singularity.
Designed around the Himes et al. (2014) airway smooth muscle dataset (dexamethasone vs untreated, GEO GSE52778). This dataset is used in the DESeq2 vignette and the Bioconductor RNA-seq workflow. For the full covariate-adjusted analysis on a COVID-19 cohort, see bulk-rnaseq-differential-expression.
- Full synthetic smoke test in GitHub Actions, including containerised FastQC, fastp, HISAT2, samtools, featureCounts, DESeq2 and MultiQC.
- Docker, Singularity and AWS Batch profiles in
nextflow.config. - Containerised FastAPI report portal under
cloud/report-portal/for S3-hosted reports and Postgres run metadata. - Render Blueprint at
render.yamlfor a deployable FastAPI plus Postgres report portal. - Live Render smoke deployment: https://rnaseq-report-portal.onrender.com/health.
nextflow_schema.jsonfor parameter discovery in Seqera Platform and other launch tooling.- Nextflow execution report, timeline, trace and DAG written to
results/pipeline_info/on every run. scripts/validate_outputs.pychecks count matrices, DESeq2 output, plots, MultiQC and run metadata in CI.
FASTQ (paired-end)
│
▼
FastQC ──────── Raw read quality assessment
│
▼
fastp ────────── Adapter trimming, quality filtering
│
▼
HISAT2 ──────── Align to GRCh38 reference genome
│
▼
samtools ─────── Sort and index BAM
│
▼
featureCounts ── Gene-level quantification (Gencode v38)
│
▼
DESeq2 ──────── Differential expression + PCA + volcano plot
│
▼
MultiQC ─────── Aggregate QC report across all samples
| Process | Tool | Container |
|---|---|---|
| FASTQC_RAW | FastQC 0.12.1 | quay.io/biocontainers/fastqc |
| FASTP | fastp 0.23.4 | quay.io/biocontainers/fastp |
| HISAT2_ALIGN | HISAT2 2.2.1 | quay.io/biocontainers/hisat2 |
| SAMTOOLS_SORT | samtools 1.21 | quay.io/biocontainers/samtools |
| FEATURECOUNTS | Subread 2.0.6 | quay.io/biocontainers/subread |
| DESEQ2 | DESeq2 1.42 + ggplot2 | quay.io/biocontainers/bioconductor-deseq2 |
| MULTIQC | MultiQC 1.27 | quay.io/biocontainers/multiqc |
All containers sourced from BioContainers.
| Sample | SRA | Condition | Donor |
|---|---|---|---|
| N61311_untreated | SRR1039508 | untreated | N61311 |
| N61311_Dex | SRR1039509 | dexamethasone | N61311 |
| N052611_untreated | SRR1039512 | untreated | N052611 |
| N052611_Dex | SRR1039513 | dexamethasone | N052611 |
Prerequisites: Nextflow (>=24.0), Docker, Java (>=11)
git clone https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline.git
cd rnaseq-nextflow-pipeline
python test/create_test_data.py
nextflow run main.nf -profile test,docker \
--genome_index "$(pwd)/test/genome" \
--gtf "$(pwd)/test/genes.gtf"
python scripts/validate_outputs.py results# 1. Download HISAT2 GRCh38 index (~4GB)
mkdir -p genome && cd genome
wget https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
tar xzf grch38_genome.tar.gz
# 2. Download Gencode v38 GTF
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
gunzip gencode.v38.annotation.gtf.gz
cd ..
# 3. Download FASTQ files from ENA (see assets/samplesheet.csv for accessions)
mkdir -p data
# Example: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz -O data/SRR1039508_1.fastq.gz
# 4. Run
nextflow run main.nf -profile docker \
--genome_index genome/grch38/genome \
--gtf genome/gencode.v38.annotation.gtfSee docs/cloud.md for AWS Batch and Seqera Platform launch notes.
nextflow run Ekin-Kahraman/rnaseq-nextflow-pipeline \
-profile awsbatch \
--aws_queue rnaseq-job-queue \
--aws_region eu-west-2 \
--aws_workdir s3://my-rnaseq-bucket/work \
--samplesheet s3://my-rnaseq-bucket/inputs/samplesheet.csv \
--genome_index s3://my-rnaseq-bucket/reference/grch38/genome \
--gtf s3://my-rnaseq-bucket/reference/gencode.v38.annotation.gtf \
--outdir s3://my-rnaseq-bucket/results/airwayThe optional cloud report portal registers cloud runs and returns signed S3 URLs for Nextflow reports, timelines, traces, DAGs and MultiQC output. It is a small FastAPI service backed by Postgres in production and SQLite for local testing. The root route renders a browser dashboard and /docs exposes the API.
cd cloud/report-portal
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000Run the local Postgres stack:
cd cloud/report-portal
docker compose up --buildDeploy shape:
render.yaml -> Docker FastAPI service + managed Postgres + S3 presigned report links
Live smoke deployment:
- Dashboard: https://rnaseq-report-portal.onrender.com/
- Health: https://rnaseq-report-portal.onrender.com/health
- Seeded artefact metadata: https://rnaseq-report-portal.onrender.com/runs/synthetic-ci-001/artifacts/report
| Parameter | Default | Description |
|---|---|---|
--samplesheet |
assets/samplesheet.csv |
CSV: sample_id, fastq_1, fastq_2, condition |
--genome_index |
required | HISAT2 index prefix |
--gtf |
required | Gene annotation GTF |
--outdir |
results |
Output directory |
--strandedness |
2 (reverse) |
featureCounts strandedness (0/1/2) |
--ref_condition |
untreated |
DESeq2 reference level |
--aws_queue |
none | AWS Batch queue for -profile awsbatch |
--aws_region |
eu-west-2 |
AWS region for -profile awsbatch |
--aws_workdir |
none | S3 work directory for -profile awsbatch |
results/
├── fastqc_raw/ Raw read QC reports
├── fastp/ Trimming reports (JSON)
├── hisat2/ Alignment logs
├── bam/ Sorted BAM files
├── counts/ Gene count matrix
├── deseq2/ DE results, volcano plot, PCA plot
├── multiqc/ Aggregated QC report
└── pipeline_info/ Nextflow report, timeline, trace, DAG
- HISAT2 over STAR - HISAT2's graph FM index fits in ~8GB RAM vs STAR's ~32GB for the human genome. Both are splice-aware aligners with comparable accuracy for well-annotated genomes; HISAT2 was chosen to keep the pipeline runnable on standard hardware.
- featureCounts over htseq-count - faster on multi-sample runs (native multithreading) and produces identical counts for standard gene-level quantification.
- BioContainers - published containers from the Bioconda ecosystem. No custom Dockerfiles to maintain.
- Docker and Singularity -
-profile dockerfor local,-profile singularityfor HPC where Docker is typically unavailable. - AWS Batch profile -
-profile awsbatchruns the same containerised workflow on managed cloud compute with S3 work and output paths. - Report portal separated from compute - Nextflow stays responsible for execution; the FastAPI portal only stores run metadata and signs S3 artefact links, which keeps the cloud proof small and auditable.
- Render Blueprint -
render.yamldefines the web service, managed Postgres database, demo seed run and AWS secret placeholders as reviewable infrastructure-as-code. - Run metadata by default - Nextflow report, timeline, trace and DAG are emitted on every run so failures and performance can be audited after the fact.
- Reverse-stranded default -
--strandedness 2because the airway dataset (and most modern Illumina dUTP protocols) produces reverse-stranded libraries. Users with older unstranded preps should set--strandedness 0. - Configurable contrast -
--ref_conditionsets the DESeq2 reference level. Defaults to "untreated" for the airway dataset. - Test profile - synthetic 50-gene genome with reads sampled from the reference sequence. Verifies the full pipeline in ~2 minutes without downloading real data.
- 2 samples per condition in the demo - underpowered for reliable DE. The DESeq2 step runs and produces output, but with n=2 the results are illustrative, not statistically robust. Proper analysis requires ≥3 replicates per condition.
- CI uses synthetic data - the public CI proves the full software path, not the biological conclusion. Real Himes/GSE52778 runs require external FASTQs, GRCh38 HISAT2 index and Gencode annotation files.
- AWS Batch proof status - the profile and report portal are implemented, but no public real AWS Batch run artefact is committed yet. The live report portal is the current cloud proof path until a real Batch run is published.
- No STAR option - only HISAT2 is implemented. Adding STAR as an alternative aligner would allow benchmarking on the same data.
MIT