Skip to content

Latest commit

 

History

History
208 lines (165 loc) · 10.1 KB

File metadata and controls

208 lines (165 loc) · 10.1 KB

RNA-seq Nextflow Pipeline

CI License: MIT Nextflow AWS Batch

Bulk RNA-seq pipeline in Nextflow DSL2. Takes paired-end FASTQ reads from raw sequencing output through to differential expression results - QC, trimming, alignment, counting, and DESeq2 - with each step containerised via Docker or Singularity.

Designed around the Himes et al. (2014) airway smooth muscle dataset (dexamethasone vs untreated, GEO GSE52778). This dataset is used in the DESeq2 vignette and the Bioconductor RNA-seq workflow. For the full covariate-adjusted analysis on a COVID-19 cohort, see bulk-rnaseq-differential-expression.

Engineering Evidence

  • Full synthetic smoke test in GitHub Actions, including containerised FastQC, fastp, HISAT2, samtools, featureCounts, DESeq2 and MultiQC.
  • Docker, Singularity and AWS Batch profiles in nextflow.config.
  • Containerised FastAPI report portal under cloud/report-portal/ for S3-hosted reports and Postgres run metadata.
  • Render Blueprint at render.yaml for a deployable FastAPI plus Postgres report portal.
  • Live Render smoke deployment: https://rnaseq-report-portal.onrender.com/health.
  • nextflow_schema.json for parameter discovery in Seqera Platform and other launch tooling.
  • Nextflow execution report, timeline, trace and DAG written to results/pipeline_info/ on every run.
  • scripts/validate_outputs.py checks count matrices, DESeq2 output, plots, MultiQC and run metadata in CI.

Workflow

FASTQ (paired-end)
    │
    ▼
 FastQC ──────── Raw read quality assessment
    │
    ▼
 fastp ────────── Adapter trimming, quality filtering
    │
    ▼
 HISAT2 ──────── Align to GRCh38 reference genome
    │
    ▼
 samtools ─────── Sort and index BAM
    │
    ▼
 featureCounts ── Gene-level quantification (Gencode v38)
    │
    ▼
 DESeq2 ──────── Differential expression + PCA + volcano plot
    │
    ▼
 MultiQC ─────── Aggregate QC report across all samples

Processes

Process Tool Container
FASTQC_RAW FastQC 0.12.1 quay.io/biocontainers/fastqc
FASTP fastp 0.23.4 quay.io/biocontainers/fastp
HISAT2_ALIGN HISAT2 2.2.1 quay.io/biocontainers/hisat2
SAMTOOLS_SORT samtools 1.21 quay.io/biocontainers/samtools
FEATURECOUNTS Subread 2.0.6 quay.io/biocontainers/subread
DESEQ2 DESeq2 1.42 + ggplot2 quay.io/biocontainers/bioconductor-deseq2
MULTIQC MultiQC 1.27 quay.io/biocontainers/multiqc

All containers sourced from BioContainers.

Samples

Sample SRA Condition Donor
N61311_untreated SRR1039508 untreated N61311
N61311_Dex SRR1039509 dexamethasone N61311
N052611_untreated SRR1039512 untreated N052611
N052611_Dex SRR1039513 dexamethasone N052611

Quick Start

Prerequisites: Nextflow (>=24.0), Docker, Java (>=11)

Test data (synthetic, ~2 minutes)

git clone https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline.git
cd rnaseq-nextflow-pipeline
python test/create_test_data.py
nextflow run main.nf -profile test,docker \
    --genome_index "$(pwd)/test/genome" \
    --gtf "$(pwd)/test/genes.gtf"
python scripts/validate_outputs.py results

Real data (airway dataset)

# 1. Download HISAT2 GRCh38 index (~4GB)
mkdir -p genome && cd genome
wget https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
tar xzf grch38_genome.tar.gz

# 2. Download Gencode v38 GTF
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
gunzip gencode.v38.annotation.gtf.gz
cd ..

# 3. Download FASTQ files from ENA (see assets/samplesheet.csv for accessions)
mkdir -p data
# Example: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz -O data/SRR1039508_1.fastq.gz

# 4. Run
nextflow run main.nf -profile docker \
    --genome_index genome/grch38/genome \
    --gtf genome/gencode.v38.annotation.gtf

Cloud execution

See docs/cloud.md for AWS Batch and Seqera Platform launch notes.

nextflow run Ekin-Kahraman/rnaseq-nextflow-pipeline \
    -profile awsbatch \
    --aws_queue rnaseq-job-queue \
    --aws_region eu-west-2 \
    --aws_workdir s3://my-rnaseq-bucket/work \
    --samplesheet s3://my-rnaseq-bucket/inputs/samplesheet.csv \
    --genome_index s3://my-rnaseq-bucket/reference/grch38/genome \
    --gtf s3://my-rnaseq-bucket/reference/gencode.v38.annotation.gtf \
    --outdir s3://my-rnaseq-bucket/results/airway

Report portal

The optional cloud report portal registers cloud runs and returns signed S3 URLs for Nextflow reports, timelines, traces, DAGs and MultiQC output. It is a small FastAPI service backed by Postgres in production and SQLite for local testing. The root route renders a browser dashboard and /docs exposes the API.

cd cloud/report-portal
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

Run the local Postgres stack:

cd cloud/report-portal
docker compose up --build

Deploy shape:

render.yaml -> Docker FastAPI service + managed Postgres + S3 presigned report links

Live smoke deployment:

Parameters

Parameter Default Description
--samplesheet assets/samplesheet.csv CSV: sample_id, fastq_1, fastq_2, condition
--genome_index required HISAT2 index prefix
--gtf required Gene annotation GTF
--outdir results Output directory
--strandedness 2 (reverse) featureCounts strandedness (0/1/2)
--ref_condition untreated DESeq2 reference level
--aws_queue none AWS Batch queue for -profile awsbatch
--aws_region eu-west-2 AWS region for -profile awsbatch
--aws_workdir none S3 work directory for -profile awsbatch

Output

results/
├── fastqc_raw/       Raw read QC reports
├── fastp/            Trimming reports (JSON)
├── hisat2/           Alignment logs
├── bam/              Sorted BAM files
├── counts/           Gene count matrix
├── deseq2/           DE results, volcano plot, PCA plot
├── multiqc/          Aggregated QC report
└── pipeline_info/    Nextflow report, timeline, trace, DAG

Design Decisions

  • HISAT2 over STAR - HISAT2's graph FM index fits in ~8GB RAM vs STAR's ~32GB for the human genome. Both are splice-aware aligners with comparable accuracy for well-annotated genomes; HISAT2 was chosen to keep the pipeline runnable on standard hardware.
  • featureCounts over htseq-count - faster on multi-sample runs (native multithreading) and produces identical counts for standard gene-level quantification.
  • BioContainers - published containers from the Bioconda ecosystem. No custom Dockerfiles to maintain.
  • Docker and Singularity - -profile docker for local, -profile singularity for HPC where Docker is typically unavailable.
  • AWS Batch profile - -profile awsbatch runs the same containerised workflow on managed cloud compute with S3 work and output paths.
  • Report portal separated from compute - Nextflow stays responsible for execution; the FastAPI portal only stores run metadata and signs S3 artefact links, which keeps the cloud proof small and auditable.
  • Render Blueprint - render.yaml defines the web service, managed Postgres database, demo seed run and AWS secret placeholders as reviewable infrastructure-as-code.
  • Run metadata by default - Nextflow report, timeline, trace and DAG are emitted on every run so failures and performance can be audited after the fact.
  • Reverse-stranded default - --strandedness 2 because the airway dataset (and most modern Illumina dUTP protocols) produces reverse-stranded libraries. Users with older unstranded preps should set --strandedness 0.
  • Configurable contrast - --ref_condition sets the DESeq2 reference level. Defaults to "untreated" for the airway dataset.
  • Test profile - synthetic 50-gene genome with reads sampled from the reference sequence. Verifies the full pipeline in ~2 minutes without downloading real data.

Limitations

  • 2 samples per condition in the demo - underpowered for reliable DE. The DESeq2 step runs and produces output, but with n=2 the results are illustrative, not statistically robust. Proper analysis requires ≥3 replicates per condition.
  • CI uses synthetic data - the public CI proves the full software path, not the biological conclusion. Real Himes/GSE52778 runs require external FASTQs, GRCh38 HISAT2 index and Gencode annotation files.
  • AWS Batch proof status - the profile and report portal are implemented, but no public real AWS Batch run artefact is committed yet. The live report portal is the current cloud proof path until a real Batch run is published.
  • No STAR option - only HISAT2 is implemented. Adding STAR as an alternative aligner would allow benchmarking on the same data.

Licence

MIT