|
| 1 | +# RNA-seq Nextflow Pipeline |
| 2 | + |
| 3 | +[](LICENSE) |
| 4 | +[](https://www.nextflow.io/) |
| 5 | + |
| 6 | +End-to-end bulk RNA-seq pipeline in Nextflow DSL2: raw FASTQ reads through quality control, alignment, gene quantification, and differential expression. Every step runs in its own Docker container. |
| 7 | + |
| 8 | +Designed for the GSE152075 SARS-CoV-2 nasopharyngeal dataset (6 samples: 3 COVID-positive, 3 negative). For the full covariate-adjusted statistical analysis on 484 samples, see [bulk-rnaseq-differential-expression](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression). |
| 9 | + |
| 10 | +## Workflow |
| 11 | + |
| 12 | +``` |
| 13 | +FASTQ (paired-end) |
| 14 | + │ |
| 15 | + ▼ |
| 16 | + FastQC ──────── Raw read quality assessment |
| 17 | + │ |
| 18 | + ▼ |
| 19 | + fastp ────────── Adapter trimming, quality filtering |
| 20 | + │ |
| 21 | + ▼ |
| 22 | + HISAT2 ──────── Align to GRCh38 reference genome |
| 23 | + │ |
| 24 | + ▼ |
| 25 | + samtools ─────── Sort and index BAM |
| 26 | + │ |
| 27 | + ▼ |
| 28 | + featureCounts ── Gene-level quantification |
| 29 | + │ |
| 30 | + ▼ |
| 31 | + DESeq2 ──────── Differential expression (positive vs negative) |
| 32 | + │ |
| 33 | + ▼ |
| 34 | + MultiQC ─────── Aggregate QC report |
| 35 | +``` |
| 36 | + |
| 37 | +## Processes |
| 38 | + |
| 39 | +| Process | Tool | Container | |
| 40 | +|---------|------|-----------| |
| 41 | +| FASTQC_RAW | FastQC 0.12.1 | `quay.io/biocontainers/fastqc` | |
| 42 | +| FASTP | fastp 0.23.4 | `quay.io/biocontainers/fastp` | |
| 43 | +| HISAT2_ALIGN | HISAT2 2.2.1 | `quay.io/biocontainers/hisat2` | |
| 44 | +| SAMTOOLS_SORT | samtools 1.21 | `quay.io/biocontainers/samtools` | |
| 45 | +| FEATURECOUNTS | Subread 2.0.6 | `quay.io/biocontainers/subread` | |
| 46 | +| DESEQ2 | DESeq2 1.42 + ggplot2 | `quay.io/biocontainers/bioconductor-deseq2` | |
| 47 | +| MULTIQC | MultiQC 1.27 | `quay.io/biocontainers/multiqc` | |
| 48 | + |
| 49 | +All containers sourced from [BioContainers](https://biocontainers.pro/). No custom Dockerfiles. |
| 50 | + |
| 51 | +## Quick Start |
| 52 | + |
| 53 | +**Prerequisites:** [Nextflow](https://www.nextflow.io/) (>=24.0), [Docker](https://www.docker.com/), Java (>=11) |
| 54 | + |
| 55 | +### Test data (synthetic, runs in ~2 minutes) |
| 56 | + |
| 57 | +```bash |
| 58 | +git clone https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline.git |
| 59 | +cd rnaseq-nextflow-pipeline |
| 60 | +python test/create_test_data.py |
| 61 | +nextflow run main.nf -profile test,docker \ |
| 62 | + --genome_index "$(pwd)/test/genome" \ |
| 63 | + --gtf "$(pwd)/test/genes.gtf" |
| 64 | +``` |
| 65 | + |
| 66 | +### Real data (GSE152075) |
| 67 | + |
| 68 | +```bash |
| 69 | +# Download HISAT2 GRCh38 index (~4GB) |
| 70 | +# Download FASTQ files from SRA (see assets/samplesheet.csv for accessions) |
| 71 | + |
| 72 | +nextflow run main.nf -profile docker \ |
| 73 | + --genome_index /path/to/grch38/genome \ |
| 74 | + --gtf /path/to/gencode.v38.annotation.gtf \ |
| 75 | + --samplesheet assets/samplesheet.csv |
| 76 | +``` |
| 77 | + |
| 78 | +## Parameters |
| 79 | + |
| 80 | +| Parameter | Default | Description | |
| 81 | +|-----------|---------|-------------| |
| 82 | +| `--samplesheet` | `assets/samplesheet.csv` | CSV with columns: sample_id, fastq_1, fastq_2, condition | |
| 83 | +| `--genome_index` | required | HISAT2 index prefix | |
| 84 | +| `--gtf` | required | Gene annotation GTF | |
| 85 | +| `--outdir` | `results` | Output directory | |
| 86 | +| `--strandedness` | `2` (reverse) | featureCounts strandedness (0/1/2) | |
| 87 | + |
| 88 | +## Output |
| 89 | + |
| 90 | +``` |
| 91 | +results/ |
| 92 | +├── fastqc_raw/ Raw read QC reports (HTML + ZIP) |
| 93 | +├── fastp/ Trimming reports (JSON) |
| 94 | +├── hisat2/ Alignment logs |
| 95 | +├── bam/ Sorted BAM files |
| 96 | +├── counts/ Gene count matrix (featureCounts) |
| 97 | +├── deseq2/ DE results CSV, volcano plot, PCA plot |
| 98 | +└── multiqc/ Aggregated QC report |
| 99 | +``` |
| 100 | + |
| 101 | +## Project Structure |
| 102 | + |
| 103 | +``` |
| 104 | +rnaseq-nextflow-pipeline/ |
| 105 | +├── main.nf Pipeline (7 processes, Nextflow DSL2) |
| 106 | +├── nextflow.config Parameters, containers, profiles |
| 107 | +├── assets/ |
| 108 | +│ └── samplesheet.csv Sample metadata (SRA accessions) |
| 109 | +├── test/ |
| 110 | +│ ├── create_test_data.py Generate synthetic test data |
| 111 | +│ ├── samplesheet.csv Test sample metadata |
| 112 | +│ ├── genome.fa Synthetic genome (50 genes) |
| 113 | +│ └── genes.gtf Synthetic annotation |
| 114 | +├── LICENSE |
| 115 | +└── README.md |
| 116 | +``` |
| 117 | + |
| 118 | +## Design Decisions |
| 119 | + |
| 120 | +- **HISAT2 over STAR** — runs on 8GB RAM. STAR requires 32GB for the human genome index. Anyone can clone and run this pipeline. |
| 121 | +- **BioContainers, not custom Dockerfiles** — industry standard, maintained by the community, reproducible without building. |
| 122 | +- **Separate samtools process** — HISAT2 and samtools in their own containers. Clean separation of concerns. |
| 123 | +- **Test profile** — synthetic 50-gene genome with reads sampled from the reference. Runs in ~2 minutes. Verifies the pipeline without downloading 30GB of real data. |
| 124 | +- **DESeq2 dispersion fallback** — handles small test datasets where standard dispersion fitting fails. Uses gene-wise estimates when the mean-dispersion trend cannot be fitted. |
| 125 | +- **Configurable strandedness** — `--strandedness` parameter for featureCounts. Default reverse-stranded (standard for Illumina dUTP protocols), unstranded for test data. |
| 126 | + |
| 127 | +## Licence |
| 128 | + |
| 129 | +MIT |
0 commit comments