Skip to content

masoodzaka/Snakemake_RNASeq

Repository files navigation

Snakemake_RNASeq - Production-Ready RNA-seq Pipeline (v2.1)

Overview

A comprehensive, expert-level Snakemake pipeline for RNA-seq analysis supporting both single-end (SE) and paired-end (PE) reads with proper strandedness handling and flexible quantification strategies.

Key Features

🎯 Critical Bioinformatics Improvements (v2.1)

  • Proper Strandedness Support - Maps strandedness metadata to tool-specific parameters:

    • Salmon: -l U (unstranded), -l ISR (fr-firststrand), -l ISF (fr-secondstrand)
    • HTSeq: -s no (unstranded), -s reverse (fr-firststrand), -s yes (fr-secondstrand)
    • STAR: --outSAMstrandField intronMotif integration
  • Single-End & Paired-End Detection - Automatic conditional logic:

    • Single-end: salmon quant -r reads.fq ...
    • Paired-end: salmon quant -1 r1.fq -2 r2.fq ...
  • Metadata-Driven Configuration - Master list schema with:

    • Required: sample, runID, fastq1, strandedness (unstranded/fr-firststrand/fr-secondstrand)
    • Optional: fastq2 (nullable for SE), read_length, condition, tissue, batch, replicate, sequencing_platform, adapter_sequence
  • Enhanced Error Handling & Validation:

    • Pre-flight input validation (file existence, parameter correctness)
    • Timestamped logging (logs/pipeline_YYYYMMDD_HHMMSS.log)
    • Clear error messages with troubleshooting guidance

🔧 Pipeline Features

  • Flexible Alignment Workflows:

    • STAR + featureCounts/HTSeq/Arriba (full alignment)
    • Salmon (pseudoalignment - fast & accurate)
    • Switch via ALIGNMENT_TOOL in config.yaml
  • Quality Control:

    • FastQC (pre/post-trimming)
    • MultiQC (unified QC report)
    • Optional: RSeQC-style analysis
  • Comprehensive Quantification:

    • Salmon pseudoalignment (TPM/counts)
    • HTSeq-count (exon-level resolution, configurable strandedness)
    • featureCounts (gene-level summary)
    • Optional: Arriba fusion detection
  • Visualization:

    • BigWig generation for genome browser visualization
    • deepTools integration for visualization

Quick Start

1. Setup Master List

Create master_list.txt with required columns:

sample	runID	fastq1	fastq2	strandedness	condition	tissue
WT_rep1	run1	path/to/WT_1.fq.gz	path/to/WT_2.fq.gz	fr-firststrand	wildtype	lung
KO_rep1	run1	path/to/KO_1.fq.gz	path/to/KO_2.fq.gz	fr-secondstrand	knockout	lung
SE_sample	run1	path/to/SE.fq.gz	NA	unstranded	control	brain

Strandedness Options:

  • unstranded: Non-strand-specific libraries (dUTP, NSR, NONE methods)
  • fr-firststrand: First-strand synthesis (dUTP, NSR, NONE-stranded, Illumina TruSeq)
  • fr-secondstrand: Second-strand synthesis (Illumina Standard, ISP methods)

Note: For single-end reads, set fastq2 to NA

2. Configure Pipeline

Edit config.yaml:

ALIGNMENT_TOOL: "salmon"  # or "star"
MIN_SCAFFOLD_EXON_LENGTH: 50

# Optional tools
DO_ARRIBA: false
DO_BIGWIG: true
DO_FEATURECOUNTS: true
DO_HTSEQ: true

3. Run Pipeline

Test run (dry-run):

./Go_pipeline.sh 8 true

Full run with 8 cores:

./Go_pipeline.sh 8

Help:

./Go_pipeline.sh --help

Environment Variables:

LATENCY_WAIT=120 ./Go_pipeline.sh 8  # Adjust latency (default: 120s)

Code Structure

Core Components

SCRIPTS/functions.py (168 lines)

  • Master list loading with selective dtype handling
  • Metadata accessors: get_fastq_pair(), get_strandedness(), get_sample_metadata()
  • Input/output aggregation for Snakemake rules

Snakefile (560 lines)

  • Helper functions for strandedness mapping
  • Rules: FastQC, Cutadapt, STAR, Salmon, HTSeq, featureCounts, Arriba, BigWig, MultiQC
  • Dynamic rule selection based on alignment tool configuration

Go_pipeline.sh (181 lines)

  • Input validation (CORES, file existence, parameter correctness)
  • Auto-logging with timestamped output
  • Dry-run capability for testing
  • Clear error messages with troubleshooting steps

SCRIPTS/master_list.schema.yaml (82 lines)

  • JSON Schema Draft-07 validation
  • Required field patterns and constraints
  • Optional metadata fields with bioinformatics context

Expert-Level Improvements (v2.1)

Strandedness Handling (CRITICAL)

Problem Solved: Previous versions used auto-detect for Salmon (-l A) and hardcoded unstranded for HTSeq (-s no), leading to incorrect quantification (up to 50% error for strand-specific libraries).

Solution Implemented:

  1. Strandedness is now a required field in master_list.txt
  2. Helper functions map strandedness to tool-specific parameters:
    • get_salmon_libtype(sample): Returns U, ISR, or ISF
    • get_htseq_strand(sample): Returns no, reverse, or yes
  3. Rules pull strandedness dynamically and apply correct parameters

Impact: Accurate quantification for all library preparation methods

Single-End Read Support

Problem Solved: Pipeline assumed paired-end reads; SE samples failed silently or with cryptic errors.

Solution Implemented:

  1. Salmon rule detects read type dynamically:
    if [ -z "$R2" ]; then
      salmon quant -r reads.fq ...
    else
      salmon quant -1 r1.fq -2 r2.fq ...
    fi
  2. Schema allows fastq2 to be nullable (NA for SE)
  3. Input functions handle missing pairs gracefully

Impact: Mixed SE/PE projects now supported seamlessly

Code Quality Enhancements

Deduplication: Reduced input function code from 4 separate implementations to single get_fastq_pair() helper

Error Handling:

  • Pre-flight validation catches issues before execution
  • Timestamped logs for debugging
  • Clear error messages guide users to solutions

Metadata Support:

  • Condition, tissue, batch, replicate fields enable batch effect tracking
  • Optional platform/adapter info for contamination analysis

Workflow Examples

Example 1: Salmon-Only (Fast)

# config.yaml
ALIGNMENT_TOOL: "salmon"
DO_ARRIBA: false
DO_BIGWIG: false
DO_HTSEQ: false
DO_FEATURECOUNTS: false

Output: Salmon quantifications (TPM/counts)

Example 2: STAR + HTSeq (Detailed)

# config.yaml
ALIGNMENT_TOOL: "star"
DO_ARRIBA: false
DO_HTSEQ: true
DO_FEATURECOUNTS: false

Output: Aligned BAM files + HTSeq counts (strand-aware)

Example 3: STAR + Fusion Detection

# config.yaml
ALIGNMENT_TOOL: "star"
DO_ARRIBA: true
DO_HTSEQ: true
DO_BIGWIG: true

Output: Alignments + Fusions + HTSeq counts + Visualization

Technical Details

Environment Setup

Each tool has a dedicated conda environment in ENVS/:

salmon.yaml     # Salmon pseudoalignment
star.yaml       # STAR aligner
htseq.yaml      # HTSeq counting
featurecounts.yaml  # Subread featureCounts
samtools.yaml   # BAM manipulation
deeptools.yaml  # BigWig generation
arriba.yaml     # Fusion detection

Activate manually:

conda env create -f ENVS/salmon.yaml
conda activate salmon

Resource Configuration

Edit Snakefile rule sections for compute resource allocation:

rule salmon_quants:
    threads: 8
    resources:
        mem_gb=16,
        runtime_hours=4

Log Structure

Pipeline creates timestamped logs:

logs/
├── pipeline_20250101_120000.log      # Main pipeline output
├── salmon_quants.log                 # Rule-specific logs
├── htseq_counts.log
└── ...

Master List Schema

Required Fields:

  • sample: Sample identifier (alphanumeric, -, _)
  • runID: Technical replicate batch ID
  • fastq1: Path to forward reads (gzipped FASTQ)
  • strandedness: Library preparation strand type

Optional Fields (batch effect tracking):

  • fastq2: Reverse reads (leave NA for single-end)
  • condition: Experimental condition (e.g., wildtype, knockout)
  • tissue: Tissue/cell type source
  • batch: Sequencing batch for batch correction
  • replicate: Biological replicate number
  • read_length: Sequencing read length (default: 150)
  • sequencing_platform: Platform used (Illumina, PacBio, etc.)
  • adapter_sequence: Expected adapter sequence (for contamination check)

Performance Notes

Salmon (Recommended for Speed):

  • ~30M reads/hour on single core
  • Memory: ~2-4 GB per sample
  • Total runtime: 15-30 minutes for 100M reads (8 cores)

STAR + HTSeq:

  • More precise exon-level quantification
  • Higher memory (8-16 GB for large genomes)
  • Runtime: 2-4 hours for 100M reads (8 cores)

Multi-Run Aggregation:

  • Multiple runID entries per sample are automatically merged
  • Useful for technical replicates from different sequencing lanes

Validation

Check master_list.txt validity:

# Schema is JSON Schema Draft-07 compatible
# Use any JSON Schema validator with SCRIPTS/master_list.schema.yaml

Pipeline automatically validates:

  • File existence (fastq1, fastq2, genome, annotation)
  • Strandedness values (unstranded/fr-firststrand/fr-secondstrand)
  • Core count (positive integer)
  • Configuration syntax

Troubleshooting

"strandedness not found in master_list.txt"

  • Ensure master_list.txt has a strandedness column
  • Check for typos in column header
  • Validate against SCRIPTS/master_list.schema.yaml

"HTSeq gave unexpected results"

  • Verify strandedness is correctly specified
  • Check annotation GTF format compatibility
  • Review HTSeq logs in logs/ directory

"Salmon SE/PE detection failed"

  • Confirm fastq2 is properly NA for SE samples
  • Check file paths are correct
  • Verify FASTQ format (gzipped or uncompressed)

Pipeline Hangs

  • Increase LATENCY_WAIT: LATENCY_WAIT=300 ./Go_pipeline.sh 8
  • Check disk space (NGS can be large)
  • Review logs for stalled rules

Citation

If you use this pipeline, please cite:

  • Snakemake: Mölder, F., et al. (2021)
  • Salmon: Patro, R., et al. (2017)
  • STAR: Dobin, A., et al. (2013)
  • HTSeq: Anders, S., et al. (2015)

License

See LICENSE file for details

Author

Masoud Zaka

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors