RNA-seq Nextflow Pipeline

Bulk RNA-seq pipeline in Nextflow DSL2. Takes paired-end FASTQ reads from raw sequencing output through to differential expression results - QC, trimming, alignment, counting, and DESeq2 - with each step containerised via Docker or Singularity.

Designed around the Himes et al. (2014) airway smooth muscle dataset (dexamethasone vs untreated, GEO GSE52778). This dataset is used in the DESeq2 vignette and the Bioconductor RNA-seq workflow. For the full covariate-adjusted analysis on a COVID-19 cohort, see bulk-rnaseq-differential-expression.

Engineering Evidence

Full synthetic smoke test in GitHub Actions, including containerised FastQC, fastp, HISAT2, samtools, featureCounts, DESeq2 and MultiQC.
Docker, Singularity and AWS Batch profiles in nextflow.config.
Containerised FastAPI report portal under cloud/report-portal/ for S3-hosted reports and Postgres run metadata.
Render Blueprint at render.yaml for a deployable FastAPI plus Postgres report portal.
Live Render smoke deployment: https://rnaseq-report-portal.onrender.com/health.
nextflow_schema.json for parameter discovery in Seqera Platform and other launch tooling.
Nextflow execution report, timeline, trace and DAG written to results/pipeline_info/ on every run.
scripts/validate_outputs.py checks count matrices, DESeq2 output, plots, MultiQC and run metadata in CI.

Workflow

FASTQ (paired-end)
    │
    ▼
 FastQC ──────── Raw read quality assessment
    │
    ▼
 fastp ────────── Adapter trimming, quality filtering
    │
    ▼
 HISAT2 ──────── Align to GRCh38 reference genome
    │
    ▼
 samtools ─────── Sort and index BAM
    │
    ▼
 featureCounts ── Gene-level quantification (Gencode v38)
    │
    ▼
 DESeq2 ──────── Differential expression + PCA + volcano plot
    │
    ▼
 MultiQC ─────── Aggregate QC report across all samples

Processes

Process	Tool	Container
FASTQC_RAW	FastQC 0.12.1	`quay.io/biocontainers/fastqc`
FASTP	fastp 0.23.4	`quay.io/biocontainers/fastp`
HISAT2_ALIGN	HISAT2 2.2.1	`quay.io/biocontainers/hisat2`
SAMTOOLS_SORT	samtools 1.21	`quay.io/biocontainers/samtools`
FEATURECOUNTS	Subread 2.0.6	`quay.io/biocontainers/subread`
DESEQ2	DESeq2 1.42 + ggplot2	`quay.io/biocontainers/bioconductor-deseq2`
MULTIQC	MultiQC 1.27	`quay.io/biocontainers/multiqc`

All containers sourced from BioContainers.

Samples

Sample	SRA	Condition	Donor
N61311_untreated	SRR1039508	untreated	N61311
N61311_Dex	SRR1039509	dexamethasone	N61311
N052611_untreated	SRR1039512	untreated	N052611
N052611_Dex	SRR1039513	dexamethasone	N052611

Quick Start

Prerequisites: Nextflow (>=24.0), Docker, Java (>=11)

Test data (synthetic, ~2 minutes)

git clone https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline.git
cd rnaseq-nextflow-pipeline
python test/create_test_data.py
nextflow run main.nf -profile test,docker \
    --genome_index "$(pwd)/test/genome" \
    --gtf "$(pwd)/test/genes.gtf"
python scripts/validate_outputs.py results

Real data (airway dataset)

# 1. Download HISAT2 GRCh38 index (~4GB)
mkdir -p genome && cd genome
wget https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
tar xzf grch38_genome.tar.gz

# 2. Download Gencode v38 GTF
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
gunzip gencode.v38.annotation.gtf.gz
cd ..

# 3. Download FASTQ files from ENA (see assets/samplesheet.csv for accessions)
mkdir -p data
# Example: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz -O data/SRR1039508_1.fastq.gz

# 4. Run
nextflow run main.nf -profile docker \
    --genome_index genome/grch38/genome \
    --gtf genome/gencode.v38.annotation.gtf

Cloud execution

See docs/cloud.md for AWS Batch and Seqera Platform launch notes.

nextflow run Ekin-Kahraman/rnaseq-nextflow-pipeline \
    -profile awsbatch \
    --aws_queue rnaseq-job-queue \
    --aws_region eu-west-2 \
    --aws_workdir s3://my-rnaseq-bucket/work \
    --samplesheet s3://my-rnaseq-bucket/inputs/samplesheet.csv \
    --genome_index s3://my-rnaseq-bucket/reference/grch38/genome \
    --gtf s3://my-rnaseq-bucket/reference/gencode.v38.annotation.gtf \
    --outdir s3://my-rnaseq-bucket/results/airway

Report portal

The optional cloud report portal registers cloud runs and returns signed S3 URLs for Nextflow reports, timelines, traces, DAGs and MultiQC output. It is a small FastAPI service backed by Postgres in production and SQLite for local testing. The root route renders a browser dashboard and /docs exposes the API.

cd cloud/report-portal
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

Run the local Postgres stack:

cd cloud/report-portal
docker compose up --build

Deploy shape:

render.yaml -> Docker FastAPI service + managed Postgres + S3 presigned report links

Live smoke deployment:

Dashboard: https://rnaseq-report-portal.onrender.com/
Health: https://rnaseq-report-portal.onrender.com/health
Seeded artefact metadata: https://rnaseq-report-portal.onrender.com/runs/synthetic-ci-001/artifacts/report

Parameters

Parameter	Default	Description
`--samplesheet`	`assets/samplesheet.csv`	CSV: sample_id, fastq_1, fastq_2, condition
`--genome_index`	required	HISAT2 index prefix
`--gtf`	required	Gene annotation GTF
`--outdir`	`results`	Output directory
`--strandedness`	`2` (reverse)	featureCounts strandedness (0/1/2)
`--ref_condition`	`untreated`	DESeq2 reference level
`--aws_queue`	none	AWS Batch queue for `-profile awsbatch`
`--aws_region`	`eu-west-2`	AWS region for `-profile awsbatch`
`--aws_workdir`	none	S3 work directory for `-profile awsbatch`

Output

results/
├── fastqc_raw/       Raw read QC reports
├── fastp/            Trimming reports (JSON)
├── hisat2/           Alignment logs
├── bam/              Sorted BAM files
├── counts/           Gene count matrix
├── deseq2/           DE results, volcano plot, PCA plot
├── multiqc/          Aggregated QC report
└── pipeline_info/    Nextflow report, timeline, trace, DAG

Design Decisions

HISAT2 over STAR - HISAT2's graph FM index fits in ~8GB RAM vs STAR's ~32GB for the human genome. Both are splice-aware aligners with comparable accuracy for well-annotated genomes; HISAT2 was chosen to keep the pipeline runnable on standard hardware.
featureCounts over htseq-count - faster on multi-sample runs (native multithreading) and produces identical counts for standard gene-level quantification.
BioContainers - published containers from the Bioconda ecosystem. No custom Dockerfiles to maintain.
Docker and Singularity - -profile docker for local, -profile singularity for HPC where Docker is typically unavailable.
AWS Batch profile - -profile awsbatch runs the same containerised workflow on managed cloud compute with S3 work and output paths.
Report portal separated from compute - Nextflow stays responsible for execution; the FastAPI portal only stores run metadata and signs S3 artefact links, which keeps the cloud proof small and auditable.
Render Blueprint - render.yaml defines the web service, managed Postgres database, demo seed run and AWS secret placeholders as reviewable infrastructure-as-code.
Run metadata by default - Nextflow report, timeline, trace and DAG are emitted on every run so failures and performance can be audited after the fact.
Reverse-stranded default - --strandedness 2 because the airway dataset (and most modern Illumina dUTP protocols) produces reverse-stranded libraries. Users with older unstranded preps should set --strandedness 0.
Configurable contrast - --ref_condition sets the DESeq2 reference level. Defaults to "untreated" for the airway dataset.
Test profile - synthetic 50-gene genome with reads sampled from the reference sequence. Verifies the full pipeline in ~2 minutes without downloading real data.

Limitations

2 samples per condition in the demo - underpowered for reliable DE. The DESeq2 step runs and produces output, but with n=2 the results are illustrative, not statistically robust. Proper analysis requires ≥3 replicates per condition.
CI uses synthetic data - the public CI proves the full software path, not the biological conclusion. Real Himes/GSE52778 runs require external FASTQs, GRCh38 HISAT2 index and Gencode annotation files.
AWS Batch proof status - the profile and report portal are implemented, but no public real AWS Batch run artefact is committed yet. The live report portal is the current cloud proof path until a real Batch run is published.
No STAR option - only HISAT2 is implemented. Adding STAR as an alternative aligner would allow benchmarking on the same data.

Licence

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNA-seq Nextflow Pipeline

Engineering Evidence

Workflow

Processes

Samples

Quick Start

Test data (synthetic, ~2 minutes)

Real data (airway dataset)

Cloud execution

Report portal

Parameters

Output

Design Decisions

Limitations

Licence

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

RNA-seq Nextflow Pipeline

Engineering Evidence

Workflow

Processes

Samples

Quick Start

Test data (synthetic, ~2 minutes)

Real data (airway dataset)

Cloud execution

Report portal

Parameters

Output

Design Decisions

Limitations

Licence