Skip to content

Commit 1dafc79

Browse files
committed
Update README for airway dataset, add CI badge, dataset table
1 parent 60ee740 commit 1dafc79

2 files changed

Lines changed: 46 additions & 42 deletions

File tree

README.md

Lines changed: 46 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
# RNA-seq Nextflow Pipeline
22

3+
[![CI](https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline/actions/workflows/ci.yml/badge.svg)](https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline/actions/workflows/ci.yml)
34
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
45
[![Nextflow](https://img.shields.io/badge/Nextflow-%E2%89%A524.0-brightgreen)](https://www.nextflow.io/)
56

6-
End-to-end bulk RNA-seq pipeline in Nextflow DSL2: raw FASTQ reads through quality control, alignment, gene quantification, and differential expression. Every step runs in its own Docker container.
7+
End-to-end bulk RNA-seq pipeline in Nextflow DSL2: raw FASTQ reads through quality control, adapter trimming, genome alignment, gene quantification, differential expression, and aggregated QC reporting. Every step runs in its own Docker container.
78

8-
Designed for the GSE152075 SARS-CoV-2 nasopharyngeal dataset (6 samples: 3 COVID-positive, 3 negative). For the full covariate-adjusted statistical analysis on 484 samples, see [bulk-rnaseq-differential-expression](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression).
9+
Demonstrated on the [Himes et al. (2014)](https://doi.org/10.1371/journal.pone.0099625) airway smooth muscle dataset — dexamethasone-treated vs untreated human airway cells. For covariate-adjusted analysis on a larger COVID-19 cohort, see [bulk-rnaseq-differential-expression](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression).
910

1011
## Workflow
1112

@@ -25,13 +26,13 @@ FASTQ (paired-end)
2526
samtools ─────── Sort and index BAM
2627
2728
28-
featureCounts ── Gene-level quantification
29+
featureCounts ── Gene-level quantification (Gencode v38)
2930
3031
31-
DESeq2 ──────── Differential expression (positive vs negative)
32+
DESeq2 ──────── Differential expression + PCA + volcano plot
3233
3334
34-
MultiQC ─────── Aggregate QC report
35+
MultiQC ─────── Aggregate QC report across all samples
3536
```
3637

3738
## Processes
@@ -46,13 +47,24 @@ FASTQ (paired-end)
4647
| DESEQ2 | DESeq2 1.42 + ggplot2 | `quay.io/biocontainers/bioconductor-deseq2` |
4748
| MULTIQC | MultiQC 1.27 | `quay.io/biocontainers/multiqc` |
4849

49-
All containers sourced from [BioContainers](https://biocontainers.pro/). No custom Dockerfiles.
50+
All containers sourced from [BioContainers](https://biocontainers.pro/).
51+
52+
## Dataset
53+
54+
**Himes et al. (2014)** — RNA-seq of human airway smooth muscle cells treated with dexamethasone (a glucocorticoid anti-inflammatory). GEO accession [GSE52778](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778). This dataset is used in the [DESeq2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) and the [Bioconductor RNA-seq workflow](https://www.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html).
55+
56+
| Sample | Accession | Condition | Donor |
57+
|--------|-----------|-----------|-------|
58+
| N61311_untreated | SRR1039508 | untreated | N61311 |
59+
| N61311_Dex | SRR1039509 | dexamethasone | N61311 |
60+
| N052611_untreated | SRR1039512 | untreated | N052611 |
61+
| N052611_Dex | SRR1039513 | dexamethasone | N052611 |
5062

5163
## Quick Start
5264

5365
**Prerequisites:** [Nextflow](https://www.nextflow.io/) (>=24.0), [Docker](https://www.docker.com/), Java (>=11)
5466

55-
### Test data (synthetic, runs in ~2 minutes)
67+
### Test data (synthetic, ~2 minutes)
5668

5769
```bash
5870
git clone https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline.git
@@ -63,66 +75,60 @@ nextflow run main.nf -profile test,docker \
6375
--gtf "$(pwd)/test/genes.gtf"
6476
```
6577

66-
### Real data (GSE152075)
78+
### Real data (airway dataset)
6779

6880
```bash
69-
# Download HISAT2 GRCh38 index (~4GB)
70-
# Download FASTQ files from SRA (see assets/samplesheet.csv for accessions)
81+
# 1. Download HISAT2 GRCh38 index (~4GB)
82+
mkdir -p genome && cd genome
83+
wget https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
84+
tar xzf grch38_genome.tar.gz
85+
86+
# 2. Download Gencode v38 GTF
87+
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
88+
gunzip gencode.v38.annotation.gtf.gz
89+
cd ..
90+
91+
# 3. Download FASTQ files from ENA (see assets/samplesheet.csv for accessions)
92+
mkdir -p data
93+
# Example: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz -O data/SRR1039508_1.fastq.gz
7194

95+
# 4. Run
7296
nextflow run main.nf -profile docker \
73-
--genome_index /path/to/grch38/genome \
74-
--gtf /path/to/gencode.v38.annotation.gtf \
75-
--samplesheet assets/samplesheet.csv
97+
--genome_index genome/grch38/genome \
98+
--gtf genome/gencode.v38.annotation.gtf
7699
```
77100

78101
## Parameters
79102

80103
| Parameter | Default | Description |
81104
|-----------|---------|-------------|
82-
| `--samplesheet` | `assets/samplesheet.csv` | CSV with columns: sample_id, fastq_1, fastq_2, condition |
105+
| `--samplesheet` | `assets/samplesheet.csv` | CSV: sample_id, fastq_1, fastq_2, condition |
83106
| `--genome_index` | required | HISAT2 index prefix |
84107
| `--gtf` | required | Gene annotation GTF |
85108
| `--outdir` | `results` | Output directory |
86109
| `--strandedness` | `2` (reverse) | featureCounts strandedness (0/1/2) |
110+
| `--ref_condition` | `untreated` | DESeq2 reference level |
87111

88112
## Output
89113

90114
```
91115
results/
92-
├── fastqc_raw/ Raw read QC reports (HTML + ZIP)
116+
├── fastqc_raw/ Raw read QC reports
93117
├── fastp/ Trimming reports (JSON)
94118
├── hisat2/ Alignment logs
95119
├── bam/ Sorted BAM files
96-
├── counts/ Gene count matrix (featureCounts)
97-
├── deseq2/ DE results CSV, volcano plot, PCA plot
120+
├── counts/ Gene count matrix
121+
├── deseq2/ DE results, volcano plot, PCA plot
98122
└── multiqc/ Aggregated QC report
99123
```
100124

101-
## Project Structure
102-
103-
```
104-
rnaseq-nextflow-pipeline/
105-
├── main.nf Pipeline (7 processes, Nextflow DSL2)
106-
├── nextflow.config Parameters, containers, profiles
107-
├── assets/
108-
│ └── samplesheet.csv Sample metadata (SRA accessions)
109-
├── test/
110-
│ ├── create_test_data.py Generate synthetic test data
111-
│ ├── samplesheet.csv Test sample metadata
112-
│ ├── genome.fa Synthetic genome (50 genes)
113-
│ └── genes.gtf Synthetic annotation
114-
├── LICENSE
115-
└── README.md
116-
```
117-
118125
## Design Decisions
119126

120-
- **HISAT2 over STAR** — runs on 8GB RAM. STAR requires 32GB for the human genome index. Anyone can clone and run this pipeline.
121-
- **BioContainers, not custom Dockerfiles** — industry standard, maintained by the community, reproducible without building.
122-
- **Separate samtools process** — HISAT2 and samtools in their own containers. Clean separation of concerns.
123-
- **Test profile** — synthetic 50-gene genome with reads sampled from the reference. Runs in ~2 minutes. Verifies the pipeline without downloading 30GB of real data.
124-
- **DESeq2 dispersion fallback** — handles small test datasets where standard dispersion fitting fails. Uses gene-wise estimates when the mean-dispersion trend cannot be fitted.
125-
- **Configurable strandedness**`--strandedness` parameter for featureCounts. Default reverse-stranded (standard for Illumina dUTP protocols), unstranded for test data.
127+
- **HISAT2 over STAR** — runs on 8GB RAM. STAR requires 32GB for the human genome. Accessible on any machine.
128+
- **BioContainers** — published, maintained Docker containers. No custom builds.
129+
- **Configurable reference level**`--ref_condition` sets the DESeq2 baseline. Works with any experimental design.
130+
- **Adaptive gene filter** — automatically adjusts minimum count threshold based on library size (stringent for real data, permissive for test data).
131+
- **Test profile** — synthetic 50-gene genome with genome-sampled reads. Verifies the full pipeline in ~2 minutes.
126132

127133
## Licence
128134

assets/samplesheet.csv

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
11
sample_id,fastq_1,fastq_2,condition
22
SRR1039508,data/SRR1039508_1.fastq.gz,data/SRR1039508_2.fastq.gz,untreated
33
SRR1039509,data/SRR1039509_1.fastq.gz,data/SRR1039509_2.fastq.gz,dexamethasone
4-
SRR1039512,data/SRR1039512_1.fastq.gz,data/SRR1039512_2.fastq.gz,untreated
5-
SRR1039513,data/SRR1039513_1.fastq.gz,data/SRR1039513_2.fastq.gz,dexamethasone

0 commit comments

Comments
 (0)