Snakemake_RNASeq - Production-Ready RNA-seq Pipeline (v2.1)

Overview

A comprehensive, expert-level Snakemake pipeline for RNA-seq analysis supporting both single-end (SE) and paired-end (PE) reads with proper strandedness handling and flexible quantification strategies.

Key Features

🎯 Critical Bioinformatics Improvements (v2.1)

Proper Strandedness Support - Maps strandedness metadata to tool-specific parameters:
- Salmon: -l U (unstranded), -l ISR (fr-firststrand), -l ISF (fr-secondstrand)
- HTSeq: -s no (unstranded), -s reverse (fr-firststrand), -s yes (fr-secondstrand)
- STAR: --outSAMstrandField intronMotif integration
Single-End & Paired-End Detection - Automatic conditional logic:
- Single-end: salmon quant -r reads.fq ...
- Paired-end: salmon quant -1 r1.fq -2 r2.fq ...
Metadata-Driven Configuration - Master list schema with:
- Required: sample, runID, fastq1, strandedness (unstranded/fr-firststrand/fr-secondstrand)
- Optional: fastq2 (nullable for SE), read_length, condition, tissue, batch, replicate, sequencing_platform, adapter_sequence
Enhanced Error Handling & Validation:
- Pre-flight input validation (file existence, parameter correctness)
- Timestamped logging (logs/pipeline_YYYYMMDD_HHMMSS.log)
- Clear error messages with troubleshooting guidance

🔧 Pipeline Features

Flexible Alignment Workflows:
- STAR + featureCounts/HTSeq/Arriba (full alignment)
- Salmon (pseudoalignment - fast & accurate)
- Switch via ALIGNMENT_TOOL in config.yaml
Quality Control:
- FastQC (pre/post-trimming)
- MultiQC (unified QC report)
- Optional: RSeQC-style analysis
Comprehensive Quantification:
- Salmon pseudoalignment (TPM/counts)
- HTSeq-count (exon-level resolution, configurable strandedness)
- featureCounts (gene-level summary)
- Optional: Arriba fusion detection
Visualization:
- BigWig generation for genome browser visualization
- deepTools integration for visualization

Quick Start

1. Setup Master List

Create master_list.txt with required columns:

sample	runID	fastq1	fastq2	strandedness	condition	tissue
WT_rep1	run1	path/to/WT_1.fq.gz	path/to/WT_2.fq.gz	fr-firststrand	wildtype	lung
KO_rep1	run1	path/to/KO_1.fq.gz	path/to/KO_2.fq.gz	fr-secondstrand	knockout	lung
SE_sample	run1	path/to/SE.fq.gz	NA	unstranded	control	brain

Strandedness Options:

unstranded: Non-strand-specific libraries (dUTP, NSR, NONE methods)
fr-firststrand: First-strand synthesis (dUTP, NSR, NONE-stranded, Illumina TruSeq)
fr-secondstrand: Second-strand synthesis (Illumina Standard, ISP methods)

Note: For single-end reads, set fastq2 to NA

2. Configure Pipeline

Edit config.yaml:

ALIGNMENT_TOOL: "salmon"  # or "star"
MIN_SCAFFOLD_EXON_LENGTH: 50

# Optional tools
DO_ARRIBA: false
DO_BIGWIG: true
DO_FEATURECOUNTS: true
DO_HTSEQ: true

3. Run Pipeline

Test run (dry-run):

./Go_pipeline.sh 8 true

Full run with 8 cores:

./Go_pipeline.sh 8

Help:

./Go_pipeline.sh --help

Environment Variables:

LATENCY_WAIT=120 ./Go_pipeline.sh 8  # Adjust latency (default: 120s)

Code Structure

Core Components

SCRIPTS/functions.py (168 lines)

Master list loading with selective dtype handling
Metadata accessors: get_fastq_pair(), get_strandedness(), get_sample_metadata()
Input/output aggregation for Snakemake rules

Snakefile (560 lines)

Helper functions for strandedness mapping
Rules: FastQC, Cutadapt, STAR, Salmon, HTSeq, featureCounts, Arriba, BigWig, MultiQC
Dynamic rule selection based on alignment tool configuration

Go_pipeline.sh (181 lines)

Input validation (CORES, file existence, parameter correctness)
Auto-logging with timestamped output
Dry-run capability for testing
Clear error messages with troubleshooting steps

SCRIPTS/master_list.schema.yaml (82 lines)

JSON Schema Draft-07 validation
Required field patterns and constraints
Optional metadata fields with bioinformatics context

Expert-Level Improvements (v2.1)

Strandedness Handling (CRITICAL)

Problem Solved: Previous versions used auto-detect for Salmon (-l A) and hardcoded unstranded for HTSeq (-s no), leading to incorrect quantification (up to 50% error for strand-specific libraries).

Solution Implemented:

Strandedness is now a required field in master_list.txt
Helper functions map strandedness to tool-specific parameters:
- get_salmon_libtype(sample): Returns U, ISR, or ISF
- get_htseq_strand(sample): Returns no, reverse, or yes
Rules pull strandedness dynamically and apply correct parameters

Impact: Accurate quantification for all library preparation methods

Single-End Read Support

Problem Solved: Pipeline assumed paired-end reads; SE samples failed silently or with cryptic errors.

Solution Implemented:

Salmon rule detects read type dynamically:

if [ -z "$R2" ]; then
  salmon quant -r reads.fq ...
else
  salmon quant -1 r1.fq -2 r2.fq ...
fi

Schema allows fastq2 to be nullable (NA for SE)
Input functions handle missing pairs gracefully

Impact: Mixed SE/PE projects now supported seamlessly

Code Quality Enhancements

Deduplication: Reduced input function code from 4 separate implementations to single get_fastq_pair() helper

Error Handling:

Pre-flight validation catches issues before execution
Timestamped logs for debugging
Clear error messages guide users to solutions

Metadata Support:

Condition, tissue, batch, replicate fields enable batch effect tracking
Optional platform/adapter info for contamination analysis

Workflow Examples

Example 1: Salmon-Only (Fast)

# config.yaml
ALIGNMENT_TOOL: "salmon"
DO_ARRIBA: false
DO_BIGWIG: false
DO_HTSEQ: false
DO_FEATURECOUNTS: false

Output: Salmon quantifications (TPM/counts)

Example 2: STAR + HTSeq (Detailed)

# config.yaml
ALIGNMENT_TOOL: "star"
DO_ARRIBA: false
DO_HTSEQ: true
DO_FEATURECOUNTS: false

Output: Aligned BAM files + HTSeq counts (strand-aware)

Example 3: STAR + Fusion Detection

# config.yaml
ALIGNMENT_TOOL: "star"
DO_ARRIBA: true
DO_HTSEQ: true
DO_BIGWIG: true

Output: Alignments + Fusions + HTSeq counts + Visualization

Technical Details

Environment Setup

Each tool has a dedicated conda environment in ENVS/:

salmon.yaml     # Salmon pseudoalignment
star.yaml       # STAR aligner
htseq.yaml      # HTSeq counting
featurecounts.yaml  # Subread featureCounts
samtools.yaml   # BAM manipulation
deeptools.yaml  # BigWig generation
arriba.yaml     # Fusion detection

Activate manually:

conda env create -f ENVS/salmon.yaml
conda activate salmon

Resource Configuration

Edit Snakefile rule sections for compute resource allocation:

rule salmon_quants:
    threads: 8
    resources:
        mem_gb=16,
        runtime_hours=4

Log Structure

Pipeline creates timestamped logs:

logs/
├── pipeline_20250101_120000.log      # Main pipeline output
├── salmon_quants.log                 # Rule-specific logs
├── htseq_counts.log
└── ...

Master List Schema

Required Fields:

sample: Sample identifier (alphanumeric, -, _)
runID: Technical replicate batch ID
fastq1: Path to forward reads (gzipped FASTQ)
strandedness: Library preparation strand type

Optional Fields (batch effect tracking):

fastq2: Reverse reads (leave NA for single-end)
condition: Experimental condition (e.g., wildtype, knockout)
tissue: Tissue/cell type source
batch: Sequencing batch for batch correction
replicate: Biological replicate number
read_length: Sequencing read length (default: 150)
sequencing_platform: Platform used (Illumina, PacBio, etc.)
adapter_sequence: Expected adapter sequence (for contamination check)

Performance Notes

Salmon (Recommended for Speed):

~30M reads/hour on single core
Memory: ~2-4 GB per sample
Total runtime: 15-30 minutes for 100M reads (8 cores)

STAR + HTSeq:

More precise exon-level quantification
Higher memory (8-16 GB for large genomes)
Runtime: 2-4 hours for 100M reads (8 cores)

Multi-Run Aggregation:

Multiple runID entries per sample are automatically merged
Useful for technical replicates from different sequencing lanes

Validation

Check master_list.txt validity:

# Schema is JSON Schema Draft-07 compatible
# Use any JSON Schema validator with SCRIPTS/master_list.schema.yaml

Pipeline automatically validates:

File existence (fastq1, fastq2, genome, annotation)
Strandedness values (unstranded/fr-firststrand/fr-secondstrand)
Core count (positive integer)
Configuration syntax

Troubleshooting

"strandedness not found in master_list.txt"

Ensure master_list.txt has a strandedness column
Check for typos in column header
Validate against SCRIPTS/master_list.schema.yaml

"HTSeq gave unexpected results"

Verify strandedness is correctly specified
Check annotation GTF format compatibility
Review HTSeq logs in logs/ directory

"Salmon SE/PE detection failed"

Confirm fastq2 is properly NA for SE samples
Check file paths are correct
Verify FASTQ format (gzipped or uncompressed)

Pipeline Hangs

Increase LATENCY_WAIT: LATENCY_WAIT=300 ./Go_pipeline.sh 8
Check disk space (NGS can be large)
Review logs for stalled rules

Citation

If you use this pipeline, please cite:

Snakemake: Mölder, F., et al. (2021)
Salmon: Patro, R., et al. (2017)
STAR: Dobin, A., et al. (2013)
HTSeq: Anders, S., et al. (2015)

License

See LICENSE file for details

Author

Masoud Zaka

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
ENVS		ENVS
SCRIPTS		SCRIPTS
Go_pipeline.sh		Go_pipeline.sh
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
master_list_example.txt		master_list_example.txt
rnaseq_workflow.png		rnaseq_workflow.png

Folders and files

Latest commit

History

Repository files navigation

Snakemake_RNASeq - Production-Ready RNA-seq Pipeline (v2.1)

Overview

Key Features

🎯 Critical Bioinformatics Improvements (v2.1)

🔧 Pipeline Features

Quick Start

1. Setup Master List

2. Configure Pipeline

3. Run Pipeline

Code Structure

Core Components

Expert-Level Improvements (v2.1)

Strandedness Handling (CRITICAL)

Single-End Read Support

Code Quality Enhancements

Workflow Examples

Example 1: Salmon-Only (Fast)

Example 2: STAR + HTSeq (Detailed)

Example 3: STAR + Fusion Detection

Technical Details

Environment Setup

Resource Configuration

Log Structure

Master List Schema

Performance Notes

Validation

Troubleshooting

Citation

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages