A comprehensive, expert-level Snakemake pipeline for RNA-seq analysis supporting both single-end (SE) and paired-end (PE) reads with proper strandedness handling and flexible quantification strategies.
-
Proper Strandedness Support - Maps strandedness metadata to tool-specific parameters:
- Salmon:
-l U(unstranded),-l ISR(fr-firststrand),-l ISF(fr-secondstrand) - HTSeq:
-s no(unstranded),-s reverse(fr-firststrand),-s yes(fr-secondstrand) - STAR:
--outSAMstrandField intronMotifintegration
- Salmon:
-
Single-End & Paired-End Detection - Automatic conditional logic:
- Single-end:
salmon quant -r reads.fq ... - Paired-end:
salmon quant -1 r1.fq -2 r2.fq ...
- Single-end:
-
Metadata-Driven Configuration - Master list schema with:
- Required: sample, runID, fastq1, strandedness (unstranded/fr-firststrand/fr-secondstrand)
- Optional: fastq2 (nullable for SE), read_length, condition, tissue, batch, replicate, sequencing_platform, adapter_sequence
-
Enhanced Error Handling & Validation:
- Pre-flight input validation (file existence, parameter correctness)
- Timestamped logging (logs/pipeline_YYYYMMDD_HHMMSS.log)
- Clear error messages with troubleshooting guidance
-
Flexible Alignment Workflows:
- STAR + featureCounts/HTSeq/Arriba (full alignment)
- Salmon (pseudoalignment - fast & accurate)
- Switch via
ALIGNMENT_TOOLin config.yaml
-
Quality Control:
- FastQC (pre/post-trimming)
- MultiQC (unified QC report)
- Optional: RSeQC-style analysis
-
Comprehensive Quantification:
- Salmon pseudoalignment (TPM/counts)
- HTSeq-count (exon-level resolution, configurable strandedness)
- featureCounts (gene-level summary)
- Optional: Arriba fusion detection
-
Visualization:
- BigWig generation for genome browser visualization
- deepTools integration for visualization
Create master_list.txt with required columns:
sample runID fastq1 fastq2 strandedness condition tissue
WT_rep1 run1 path/to/WT_1.fq.gz path/to/WT_2.fq.gz fr-firststrand wildtype lung
KO_rep1 run1 path/to/KO_1.fq.gz path/to/KO_2.fq.gz fr-secondstrand knockout lung
SE_sample run1 path/to/SE.fq.gz NA unstranded control brain
Strandedness Options:
unstranded: Non-strand-specific libraries (dUTP, NSR, NONE methods)fr-firststrand: First-strand synthesis (dUTP, NSR, NONE-stranded, Illumina TruSeq)fr-secondstrand: Second-strand synthesis (Illumina Standard, ISP methods)
Note: For single-end reads, set fastq2 to NA
Edit config.yaml:
ALIGNMENT_TOOL: "salmon" # or "star"
MIN_SCAFFOLD_EXON_LENGTH: 50
# Optional tools
DO_ARRIBA: false
DO_BIGWIG: true
DO_FEATURECOUNTS: true
DO_HTSEQ: trueTest run (dry-run):
./Go_pipeline.sh 8 trueFull run with 8 cores:
./Go_pipeline.sh 8Help:
./Go_pipeline.sh --helpEnvironment Variables:
LATENCY_WAIT=120 ./Go_pipeline.sh 8 # Adjust latency (default: 120s)SCRIPTS/functions.py (168 lines)
- Master list loading with selective dtype handling
- Metadata accessors:
get_fastq_pair(),get_strandedness(),get_sample_metadata() - Input/output aggregation for Snakemake rules
Snakefile (560 lines)
- Helper functions for strandedness mapping
- Rules: FastQC, Cutadapt, STAR, Salmon, HTSeq, featureCounts, Arriba, BigWig, MultiQC
- Dynamic rule selection based on alignment tool configuration
Go_pipeline.sh (181 lines)
- Input validation (CORES, file existence, parameter correctness)
- Auto-logging with timestamped output
- Dry-run capability for testing
- Clear error messages with troubleshooting steps
SCRIPTS/master_list.schema.yaml (82 lines)
- JSON Schema Draft-07 validation
- Required field patterns and constraints
- Optional metadata fields with bioinformatics context
Problem Solved: Previous versions used auto-detect for Salmon (-l A) and hardcoded unstranded for HTSeq (-s no), leading to incorrect quantification (up to 50% error for strand-specific libraries).
Solution Implemented:
- Strandedness is now a required field in master_list.txt
- Helper functions map strandedness to tool-specific parameters:
get_salmon_libtype(sample): Returns U, ISR, or ISFget_htseq_strand(sample): Returns no, reverse, or yes
- Rules pull strandedness dynamically and apply correct parameters
Impact: Accurate quantification for all library preparation methods
Problem Solved: Pipeline assumed paired-end reads; SE samples failed silently or with cryptic errors.
Solution Implemented:
- Salmon rule detects read type dynamically:
if [ -z "$R2" ]; then salmon quant -r reads.fq ... else salmon quant -1 r1.fq -2 r2.fq ... fi
- Schema allows fastq2 to be nullable (NA for SE)
- Input functions handle missing pairs gracefully
Impact: Mixed SE/PE projects now supported seamlessly
Deduplication: Reduced input function code from 4 separate implementations to single get_fastq_pair() helper
Error Handling:
- Pre-flight validation catches issues before execution
- Timestamped logs for debugging
- Clear error messages guide users to solutions
Metadata Support:
- Condition, tissue, batch, replicate fields enable batch effect tracking
- Optional platform/adapter info for contamination analysis
# config.yaml
ALIGNMENT_TOOL: "salmon"
DO_ARRIBA: false
DO_BIGWIG: false
DO_HTSEQ: false
DO_FEATURECOUNTS: falseOutput: Salmon quantifications (TPM/counts)
# config.yaml
ALIGNMENT_TOOL: "star"
DO_ARRIBA: false
DO_HTSEQ: true
DO_FEATURECOUNTS: falseOutput: Aligned BAM files + HTSeq counts (strand-aware)
# config.yaml
ALIGNMENT_TOOL: "star"
DO_ARRIBA: true
DO_HTSEQ: true
DO_BIGWIG: trueOutput: Alignments + Fusions + HTSeq counts + Visualization
Each tool has a dedicated conda environment in ENVS/:
salmon.yaml # Salmon pseudoalignment
star.yaml # STAR aligner
htseq.yaml # HTSeq counting
featurecounts.yaml # Subread featureCounts
samtools.yaml # BAM manipulation
deeptools.yaml # BigWig generation
arriba.yaml # Fusion detectionActivate manually:
conda env create -f ENVS/salmon.yaml
conda activate salmonEdit Snakefile rule sections for compute resource allocation:
rule salmon_quants:
threads: 8
resources:
mem_gb=16,
runtime_hours=4Pipeline creates timestamped logs:
logs/
├── pipeline_20250101_120000.log # Main pipeline output
├── salmon_quants.log # Rule-specific logs
├── htseq_counts.log
└── ...
Required Fields:
sample: Sample identifier (alphanumeric, -, _)runID: Technical replicate batch IDfastq1: Path to forward reads (gzipped FASTQ)strandedness: Library preparation strand type
Optional Fields (batch effect tracking):
fastq2: Reverse reads (leave NA for single-end)condition: Experimental condition (e.g., wildtype, knockout)tissue: Tissue/cell type sourcebatch: Sequencing batch for batch correctionreplicate: Biological replicate numberread_length: Sequencing read length (default: 150)sequencing_platform: Platform used (Illumina, PacBio, etc.)adapter_sequence: Expected adapter sequence (for contamination check)
Salmon (Recommended for Speed):
- ~30M reads/hour on single core
- Memory: ~2-4 GB per sample
- Total runtime: 15-30 minutes for 100M reads (8 cores)
STAR + HTSeq:
- More precise exon-level quantification
- Higher memory (8-16 GB for large genomes)
- Runtime: 2-4 hours for 100M reads (8 cores)
Multi-Run Aggregation:
- Multiple runID entries per sample are automatically merged
- Useful for technical replicates from different sequencing lanes
Check master_list.txt validity:
# Schema is JSON Schema Draft-07 compatible
# Use any JSON Schema validator with SCRIPTS/master_list.schema.yamlPipeline automatically validates:
- File existence (fastq1, fastq2, genome, annotation)
- Strandedness values (unstranded/fr-firststrand/fr-secondstrand)
- Core count (positive integer)
- Configuration syntax
"strandedness not found in master_list.txt"
- Ensure master_list.txt has a
strandednesscolumn - Check for typos in column header
- Validate against SCRIPTS/master_list.schema.yaml
"HTSeq gave unexpected results"
- Verify strandedness is correctly specified
- Check annotation GTF format compatibility
- Review HTSeq logs in logs/ directory
"Salmon SE/PE detection failed"
- Confirm fastq2 is properly NA for SE samples
- Check file paths are correct
- Verify FASTQ format (gzipped or uncompressed)
Pipeline Hangs
- Increase LATENCY_WAIT:
LATENCY_WAIT=300 ./Go_pipeline.sh 8 - Check disk space (NGS can be large)
- Review logs for stalled rules
If you use this pipeline, please cite:
- Snakemake: Mölder, F., et al. (2021)
- Salmon: Patro, R., et al. (2017)
- STAR: Dobin, A., et al. (2013)
- HTSeq: Anders, S., et al. (2015)
See LICENSE file for details
Masoud Zaka
