Update README for airway dataset, add CI badge, dataset table

Ekin-Kahraman · Ekin-Kahraman · commit 1dafc797bec6 · 2026-04-05T03:10:11.000+01:00
diff --git a/README.md b/README.md
@@ -1,11 +1,12 @@
 # RNA-seq Nextflow Pipeline
 
+[![CI](https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline/actions/workflows/ci.yml/badge.svg)](https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline/actions/workflows/ci.yml)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 [![Nextflow](https://img.shields.io/badge/Nextflow-%E2%89%A524.0-brightgreen)](https://www.nextflow.io/)
 
-End-to-end bulk RNA-seq pipeline in Nextflow DSL2: raw FASTQ reads through quality control, alignment, gene quantification, and differential expression. Every step runs in its own Docker container.
+End-to-end bulk RNA-seq pipeline in Nextflow DSL2: raw FASTQ reads through quality control, adapter trimming, genome alignment, gene quantification, differential expression, and aggregated QC reporting. Every step runs in its own Docker container.
 
-Designed for the GSE152075 SARS-CoV-2 nasopharyngeal dataset (6 samples: 3 COVID-positive, 3 negative). For the full covariate-adjusted statistical analysis on 484 samples, see [bulk-rnaseq-differential-expression](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression).
+Demonstrated on the [Himes et al. (2014)](https://doi.org/10.1371/journal.pone.0099625) airway smooth muscle dataset — dexamethasone-treated vs untreated human airway cells. For covariate-adjusted analysis on a larger COVID-19 cohort, see [bulk-rnaseq-differential-expression](https://github.com/Ekin-Kahraman/bulk-rnaseq-differential-expression).
 
 ## Workflow
 
@@ -25,13 +26,13 @@ FASTQ (paired-end)
  samtools ─────── Sort and index BAM
     │
     ▼
- featureCounts ── Gene-level quantification
+ featureCounts ── Gene-level quantification (Gencode v38)
     │
     ▼
- DESeq2 ──────── Differential expression (positive vs negative)
+ DESeq2 ──────── Differential expression + PCA + volcano plot
     │
     ▼
- MultiQC ─────── Aggregate QC report
+ MultiQC ─────── Aggregate QC report across all samples
 ```
 
 ## Processes
@@ -46,13 +47,24 @@ FASTQ (paired-end)
 | DESEQ2 | DESeq2 1.42 + ggplot2 | `quay.io/biocontainers/bioconductor-deseq2` |
 | MULTIQC | MultiQC 1.27 | `quay.io/biocontainers/multiqc` |
 
-All containers sourced from [BioContainers](https://biocontainers.pro/). No custom Dockerfiles.
+All containers sourced from [BioContainers](https://biocontainers.pro/).
+
+## Dataset
+
+**Himes et al. (2014)** — RNA-seq of human airway smooth muscle cells treated with dexamethasone (a glucocorticoid anti-inflammatory). GEO accession [GSE52778](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778). This dataset is used in the [DESeq2 vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) and the [Bioconductor RNA-seq workflow](https://www.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html).
+
+| Sample | Accession | Condition | Donor |
+|--------|-----------|-----------|-------|
+| N61311_untreated | SRR1039508 | untreated | N61311 |
+| N61311_Dex | SRR1039509 | dexamethasone | N61311 |
+| N052611_untreated | SRR1039512 | untreated | N052611 |
+| N052611_Dex | SRR1039513 | dexamethasone | N052611 |
 
 ## Quick Start
 
 **Prerequisites:** [Nextflow](https://www.nextflow.io/) (>=24.0), [Docker](https://www.docker.com/), Java (>=11)
 
-### Test data (synthetic, runs in ~2 minutes)
+### Test data (synthetic, ~2 minutes)
 
 ```bash
 git clone https://github.com/Ekin-Kahraman/rnaseq-nextflow-pipeline.git
@@ -63,66 +75,60 @@ nextflow run main.nf -profile test,docker \
     --gtf "$(pwd)/test/genes.gtf"
 ```
 
-### Real data (GSE152075)
+### Real data (airway dataset)
 
 ```bash
-# Download HISAT2 GRCh38 index (~4GB)
-# Download FASTQ files from SRA (see assets/samplesheet.csv for accessions)
+# 1. Download HISAT2 GRCh38 index (~4GB)
+mkdir -p genome && cd genome
+wget https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
+tar xzf grch38_genome.tar.gz
+
+# 2. Download Gencode v38 GTF
+wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
+gunzip gencode.v38.annotation.gtf.gz
+cd ..
+
+# 3. Download FASTQ files from ENA (see assets/samplesheet.csv for accessions)
+mkdir -p data
+# Example: wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR103/008/SRR1039508/SRR1039508_1.fastq.gz -O data/SRR1039508_1.fastq.gz
 
+# 4. Run
 nextflow run main.nf -profile docker \
-    --genome_index /path/to/grch38/genome \
-    --gtf /path/to/gencode.v38.annotation.gtf \
-    --samplesheet assets/samplesheet.csv
+    --genome_index genome/grch38/genome \
+    --gtf genome/gencode.v38.annotation.gtf
 ```
 
 ## Parameters
 
 | Parameter | Default | Description |
 |-----------|---------|-------------|
-| `--samplesheet` | `assets/samplesheet.csv` | CSV with columns: sample_id, fastq_1, fastq_2, condition |
+| `--samplesheet` | `assets/samplesheet.csv` | CSV: sample_id, fastq_1, fastq_2, condition |
 | `--genome_index` | required | HISAT2 index prefix |
 | `--gtf` | required | Gene annotation GTF |
 | `--outdir` | `results` | Output directory |
 | `--strandedness` | `2` (reverse) | featureCounts strandedness (0/1/2) |
+| `--ref_condition` | `untreated` | DESeq2 reference level |
 
 ## Output
 
 ```
 results/
-├── fastqc_raw/       Raw read QC reports (HTML + ZIP)
+├── fastqc_raw/       Raw read QC reports
 ├── fastp/            Trimming reports (JSON)
 ├── hisat2/           Alignment logs
 ├── bam/              Sorted BAM files
-├── counts/           Gene count matrix (featureCounts)
-├── deseq2/           DE results CSV, volcano plot, PCA plot
+├── counts/           Gene count matrix
+├── deseq2/           DE results, volcano plot, PCA plot
 └── multiqc/          Aggregated QC report
 ```
 
-## Project Structure
-
-```
-rnaseq-nextflow-pipeline/
-├── main.nf              Pipeline (7 processes, Nextflow DSL2)
-├── nextflow.config      Parameters, containers, profiles
-├── assets/
-│   └── samplesheet.csv  Sample metadata (SRA accessions)
-├── test/
-│   ├── create_test_data.py   Generate synthetic test data
-│   ├── samplesheet.csv       Test sample metadata
-│   ├── genome.fa             Synthetic genome (50 genes)
-│   └── genes.gtf             Synthetic annotation
-├── LICENSE
-└── README.md
-```
-
 ## Design Decisions
 
-- **HISAT2 over STAR** — runs on 8GB RAM. STAR requires 32GB for the human genome index. Anyone can clone and run this pipeline.
-- **BioContainers, not custom Dockerfiles** — industry standard, maintained by the community, reproducible without building.
-- **Separate samtools process** — HISAT2 and samtools in their own containers. Clean separation of concerns.
-- **Test profile** — synthetic 50-gene genome with reads sampled from the reference. Runs in ~2 minutes. Verifies the pipeline without downloading 30GB of real data.
-- **DESeq2 dispersion fallback** — handles small test datasets where standard dispersion fitting fails. Uses gene-wise estimates when the mean-dispersion trend cannot be fitted.
-- **Configurable strandedness** — `--strandedness` parameter for featureCounts. Default reverse-stranded (standard for Illumina dUTP protocols), unstranded for test data.
+- **HISAT2 over STAR** — runs on 8GB RAM. STAR requires 32GB for the human genome. Accessible on any machine.
+- **BioContainers** — published, maintained Docker containers. No custom builds.
+- **Configurable reference level** — `--ref_condition` sets the DESeq2 baseline. Works with any experimental design.
+- **Adaptive gene filter** — automatically adjusts minimum count threshold based on library size (stringent for real data, permissive for test data).
+- **Test profile** — synthetic 50-gene genome with genome-sampled reads. Verifies the full pipeline in ~2 minutes.
 
 ## Licence
 
diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,5 +1,3 @@
 sample_id,fastq_1,fastq_2,condition
 SRR1039508,data/SRR1039508_1.fastq.gz,data/SRR1039508_2.fastq.gz,untreated
 SRR1039509,data/SRR1039509_1.fastq.gz,data/SRR1039509_2.fastq.gz,dexamethasone
-SRR1039512,data/SRR1039512_1.fastq.gz,data/SRR1039512_2.fastq.gz,untreated
-SRR1039513,data/SRR1039513_1.fastq.gz,data/SRR1039513_2.fastq.gz,dexamethasone