Skip to content

masoodzaka/Snakemake_smallRNASeq

Repository files navigation

Snakemake Small RNA-seq Pipeline

A comprehensive bioinformatics workflow for processing and analyzing small RNA sequencing (sRNA-seq) data with a focus on microRNA (miRNA) discovery and quantification. This pipeline integrates quality control, sequence alignment, and expression quantification using industry-standard tools.

Author: Masood Zaka
License: MIT
Repository: https://github.com/masoodzaka/Snakemake_smallRNASeq

Pipeline Workflow


Table of Contents


Overview

Small RNA sequencing (sRNA-seq) enables genome-wide discovery and quantification of small non-coding RNAs including microRNAs (miRNAs), siRNAs, piRNAs, and other regulatory RNAs. This pipeline is designed for:

  • miRNA discovery and quantification using miRBase annotation
  • End-to-end processing from raw FASTQ to expression matrices
  • Quality assessment at multiple checkpoints
  • Dual quantification methods combining alignment-based (featureCounts) and pseudo-alignment (Salmon) approaches
  • Reproducible research with containerized environments and version-controlled dependencies

Typical Use Cases

  • Small RNA profiling in disease studies (e.g., cancer, diabetes)
  • miRNA biomarker discovery
  • Comparative expression analysis across conditions/tissues
  • Investigation of circulating miRNAs in body fluids

Pipeline Workflow

Raw FASTQ reads
    ↓
[FastQC] - Initial Quality Assessment
    ↓
[Cutadapt] - 3' Adapter Removal & Length Filtering
    ↓
[FastQC] - Post-trimming Quality Control
    ↓
[Bowtie] - Alignment to Reference Genome
    ↓
[Samtools] - BAM Conversion, Sorting & Indexing
    ↓
    ├─→ [Samtools Filter] - Extract miRNA Regions (BED)
    │       ↓
    │   ├─→ [Bedtools] - Annotation to miRNA coordinates
    │   │
    │   └─→ [featureCounts] - miRNA Expression Quantification
    │
    └─→ [BAM to FastQ] - Convert for Salmon processing
            ↓
        [Salmon Index] - Build Salmon index from transcripts
            ↓
        [Salmon Quant] - Pseudo-alignment quantification
            ↓
[MultiQC] - Consolidated QC Report

Features

Comprehensive Quality Control

  • FastQC before and after trimming
  • Samtools flagstat for alignment statistics
  • Adapter content and quality filtering metrics
  • MultiQC aggregated reporting

Optimized for Small RNA

  • Appropriate read length constraints (18-30 bp for miRNAs)
  • miRNA-specific alignment parameters
  • Strand-specific quantification support

Dual Quantification Methods

  • Alignment-based approach (featureCounts) for high confidence
  • Pseudo-alignment (Salmon) for improved speed and sensitivity
  • Cross-validation of results

Production-Ready

  • Conda environments for reproducibility
  • Automatic reference data downloading
  • Snakemake best practices
  • Comprehensive logging

Annotation Resources

  • Automatic GENCODE reference download (v41, GRCh38)
  • miRBase v22 annotation inclusion
  • GTF/GFF format reference files

Requirements

System Requirements

  • OS: Linux/macOS
  • Disk Space: Minimum 50 GB (including reference genomes)
  • RAM: 16-32 GB recommended
  • CPU Cores: 8+ cores recommended

Software Dependencies

All software is managed via Conda environments. Base requirements:

Tool Version Purpose
Snakemake 7.0+ Workflow orchestration
Conda/Mamba Latest Environment management
Python 3.8+ Scripting

Per-rule tools:

  • FastQC (0.11.8+) - Read quality assessment
  • Cutadapt (4.1+) - Adapter trimming
  • Bowtie (1.3.1) - Short-read alignment
  • Samtools (1.15+) - BAM file manipulation
  • Bedtools (2.30+) - Genomic interval operations
  • Subread/featureCounts (2.0.1+) - Read quantification
  • Salmon (1.9.0+) - Pseudo-alignment quantification
  • MultiQC (1.12+) - Report aggregation
  • umi_tools (latest) - UMI deduplication (optional)

Installation

1. Clone Repository

git clone https://github.com/masoodzaka/Snakemake_smallRNASeq.git
cd Snakemake_smallRNASeq

2. Install Conda/Mamba (if not already installed)

# Option A: Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# Option B: Mamba (faster package resolution)
conda install -c conda-forge mamba

3. Install Snakemake

# Using mamba (recommended for speed)
mamba create -c conda-forge -c bioconda -n snakemake snakemake=7.0

# Or with conda
conda create -c conda-forge -c bioconda -n snakemake snakemake=7.0

# Activate environment
conda activate snakemake

4. Verify Installation

snakemake --version

Data Preparation

Input Directory Structure

Create the following directory structure in your project root:

Snakemake_smallRNASeq/
├── data/
│   └── Reads/
│       ├── Sample1/
│       │   └── Sample1.fq.gz
│       ├── Sample2/
│       │   └── Sample2.fq.gz
│       └── ...
├── Snakefile
├── envs/
├── config.yaml (optional)
└── ...

Input File Format

  • Format: Compressed FASTQ (.fq.gz or .fastq.gz)
  • Quality Score: Phred+33 (modern Illumina/Ion Torrent)
  • Expected Read Length: 15-35 bp (typically 18-30 bp for mature miRNAs)

Sample Naming

The pipeline uses sample names defined in the SAMPLES list in the Snakefile:

SAMPLES=["DFU01","DFU02","DFU03","DFU04","DM01","DM02","DM03","DM04"]

Each sample name corresponds to files at: data/Reads/{sample}/{sample}.fq.gz


Configuration

Current Setup

The pipeline currently uses hardcoded sample names and paths. Recommended improvement: Create a config.yaml file:

# config.yaml
samples:
  - DFU01
  - DFU02
  - DFU03
  - DFU04
  - DM01
  - DM02
  - DM03
  - DM04

data_dir: data/Reads
results_dir: results

# Reference configuration
reference:
  organism: human
  gencode_release: 41
  mirbase_version: 22
  assembly: GRCh38

# Trimming parameters
trimming:
  min_length: 18
  max_length: 30
  quality_cutoff: 20
  adapters: []  # Add specific adapters if needed

# Alignment parameters
alignment:
  max_mismatches: 1
  seed_length: 20
  max_hits: 1

# Quantification
quantification:
  stranded: true  # Set to true for strand-specific quantification
  overlap_fraction: 0.2

Parameter Tuning

Adapter Sequences

If using specific library kits, add 3' adapter sequences:

cutadapt \
  -a GGCCGGCCGG \      # Illumina small RNA kit
  -a AGATCGGAAGAGCACA \  # TruSeq
  ...

Alignment Stringency

Adjust bowtie parameters for sensitivity/specificity trade-off:

bowtie \
  -n 2 \        # Allow 2 mismatches in seed region (more sensitive)
  -e 70 \       # Set maximum allowed sum of qualities at mismatched positions
  --best \      # Report only best alignment
  -m 5 \        # Report up to 5 alignments per read
  ...

Usage

Basic Execution

# Activate Snakemake environment
conda activate snakemake

# Dry run (preview what will be executed)
snakemake --cores 8 --use-conda -n

# Execute pipeline
snakemake --cores 8 --use-conda

# With progress bar
snakemake --cores 8 --use-conda --progress

# Run specific rules only
snakemake bowtie_align --cores 8 --use-conda

Advanced Execution Options

# Run with verbose logging
snakemake --cores 8 --use-conda --verbose

# Create Directed Acyclic Graph (DAG) visualization
snakemake --dag | dot -Tpng > pipeline_dag.png

# Execute with custom resource allocation
snakemake --cores 8 --resources mem_mb=32000 --use-conda

# Run on cluster (example: SLURM)
snakemake --cores 128 --cluster "sbatch --cpus-per-task {threads} --mem {resources.mem_mb}" --use-conda

# Profile-guided optimization
snakemake --cores 8 --use-conda --profile=slurm

Expected Runtime

On a modern system (16 cores, 32GB RAM):

  • Whole pipeline (8 samples): 2-4 hours
  • Reference download: 15-30 minutes (first run only)
  • Trimming: 10-15 minutes
  • Alignment: 30-45 minutes
  • Quantification: 20-30 minutes

Output Structure

Directory Organization

Project_Root/
├── qc/
│   ├── fastqc/                 # Pre-trimming quality reports
│   ├── fastqc_trimmed/         # Post-trimming quality reports
│   ├── cutadapt/               # Trimming statistics (JSON)
│   ├── samtools/               # Alignment flagstat files
│   └── multiqc_report.html     # Aggregated QC report
├── cutadapt/
│   └── {sample}.trimmed.fq.gz  # Trimmed reads (intermediate)
├── bowtie_align/
│   ├── {sample}.sorted.bam     # Sorted BAM files
│   ├── {sample}.sorted.bam.bai # BAM indices
│   └── {sample}.log            # Alignment logs
├── filtered_bam/
│   ├── {sample}.bam            # miRNA-region filtered BAM
│   └── {sample}.bam.bai        # Index files
├── filtered_fastq/
│   └── {sample}.fastq.gz       # FASTQ from miRNA regions
├── annotation/
│   └── {sample}.bed            # miRNA annotations (BED format)
├── featureCounts/
│   ├── {sample}.featureCounts  # miRNA quantification matrix
│   └── {sample}.featureCounts.summary  # Quantification stats
├── salmon/
│   └── {sample}_quant/
│       ├── quant.sf            # Salmon quantification
│       ├── aux_info/           # Auxiliary information
│       └── ...
└── refs/
    ├── reference.fasta         # Genome reference
    ├── transcripts.fasta       # Transcript reference
    ├── reference.gtf           # Gene annotation
    ├── mirbase.gff             # miRNA coordinates
    ├── mirbase.bed             # miRNA regions (BED)
    ├── salmon_index/           # Salmon index
    └── gentrome.fasta          # Combined reference

Key Output Files

File Format Description
qc/multiqc_report.html HTML Interactive QC summary
featureCounts/{sample}.featureCounts TSV miRNA expression matrix
salmon/{sample}_quant/quant.sf TSV Transcript abundances
bowtie_align/{sample}.sorted.bam BAM Aligned reads
annotation/{sample}.bed BED Annotated miRNA alignments

Quality Control

MultiQC Report

The final QC report (qc/multiqc_report.html) aggregates:

  1. FastQC Results

    • Per-base quality scores
    • Adapter content
    • GC distribution
    • N content
  2. Cutadapt Statistics

    • Adapter removal efficiency
    • Read length distribution
    • Quality score filtering results
  3. Alignment Metrics

    • Mapping rates
    • Strand distribution (if applicable)
    • Insert size distribution (paired-end)
  4. Quantification Summary

    • featureCounts assignment rates
    • Unassigned read categories

Quality Checkpoints

Checkpoint Metric Expected Range
Raw reads (FastQC) Mean quality ≥ Q25 >95%
Trimmed reads Length 18-30 bp >80%
Mapping rate Uniquely mapped 40-70%
miRNA assignment Assigned to miRNA 30-60%

Common Issues & Solutions

Issue Possible Cause Solution
Low mapping rate (<30%) Wrong reference organism Verify reference selection
High adapter content Incomplete trimming Adjust cutadapt parameters
Low miRNA assignment Poor alignment or annotation Check miRNA region overlap
High quality drops Library prep issues Review input FASTQ quality

Advanced Features

UMI (Unique Molecular Identifier) Deduplication

If your library uses UMIs, add deduplication:

# After alignment and BAM sorting
rule umi_extract:
    input: "bowtie_align/{sample}.sorted.bam"
    output: "dedup/{sample}.umi_extracted.bam"
    conda: "envs/umitools.yaml"
    shell: "umi_tools extract -I {input} -S {output}"

rule umi_dedup:
    input: "dedup/{sample}.umi_extracted.bam"
    output: "dedup/{sample}.dedup.bam"
    conda: "envs/umitools.yaml"
    shell: "umi_tools dedup -I {input} -S {output}"

Custom Adapter Sequences

Modify the cutadapt rule to accept kit-specific adapters:

params:
    adapters="-a GGCCGGCCGG"  # Specify in Snakefile or config
    
shell:
    cutadapt {params.adapters} ...

Expression Normalization

Add post-processing for CPM/TPM normalization:

# normalize_expression.R
library(edgeR)

counts <- read.table("featureCounts/combined.txt", row.names=1)
cpm <- cpm(counts)
write.csv(cpm, file="results/expression_cpm.csv")

Differential Expression Analysis

Integrate with downstream analysis:

# deseq2_analysis.R
library(DESeq2)

# Create DESeqDataSet from featureCounts outputs
# ... differential analysis pipeline

Troubleshooting

Common Errors

1. "Command not found: bowtie"

# Solution: Activate conda environment
conda activate snakemake
# Or ensure the environment is properly created
snakemake --use-conda --cores 8

2. "No such file or directory: data/Reads/Sample1/Sample1.fq.gz"

# Solution: Verify input file structure
ls -la data/Reads/
# Ensure files match SAMPLES variable in Snakefile

3. Low mapping rates (<20%)

# Check sequence quality
zcat data/Reads/Sample1/Sample1.fq.gz | head -20

# Verify reference organism matches input data
# Consider adjusting bowtie parameters for more permissive alignment

4. "Out of memory" error

# Solution: Limit parallel execution
snakemake --cores 4 --use-conda

# Or allocate more system resources
snakemake --cores 8 --resources mem_mb=64000 --use-conda

Debug Mode

# Snakemake debug mode
snakemake --cores 8 --use-conda --debug

# Keep intermediate files for inspection
snakemake --cores 8 --use-conda --notemp

# Print all shell commands
snakemake --cores 8 --use-conda --quiet

Citation

If you use this pipeline in your research, please cite:

@software{snakemake_smallrnaseq,
  author = {Zaka, Masood},
  title = {Snakemake Small RNA-seq Pipeline},
  url = {https://github.com/masoodzaka/Snakemake_smallRNASeq},
  year = {2024}
}

Also cite the tools used:

  • Snakemake: Mölder, F., et al. (2021). Sustainable data analysis with Snakemake. F1000Research, 10, 33.
  • FastQC: Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data.
  • Cutadapt: Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.Journal, 17(1), 10-12.
  • Bowtie: Langmead, B., et al. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25.
  • featureCounts: Liao, Y., et al. (2014). featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), 923-930.
  • Salmon: Patro, R., et al. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4), 417-419.
  • miRBase: Kozomara, A., & Griffiths-Jones, S. (2014). miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research, 42(D1), D68-D73.

Expert Recommendations for Pipeline Improvement

Priority 1: Critical Updates

  1. Upgrade Bowtie to Bowtie2

    • Better handling of mismatches and indels
    • More sensitive for SNP-containing miRNAs
    # envs/bowtie2.yaml
    channels:
      - conda-forge
      - bioconda
    dependencies:
      - bowtie2 >= 2.5.0
  2. Implement UMI Deduplication

    • Remove PCR duplicates for accurate quantification
    • Essential for small RNA where duplicates bias expression estimates
    rule markdup:
        conda: "envs/samtools.yaml"
        shell: "samtools markdup -r {input} {output}"
  3. Pin All Package Versions

    • Current umitools.yaml lacks version
    • Prevents reproducibility issues
    - umi_tools = 1.1.4
    - fastqc = 0.11.9
    - multiqc = 1.14
  4. Add Strand-Specific Quantification

    • Change featureCounts from -s 0 to -s 1 or -s 2
    • miRNAs are strand-specific; current setting may cause misassignment

Priority 2: Enhanced Features

  1. Create Config File

    • Enable parameter flexibility without editing Snakefile
    • Support multiple experiment types
  2. Add Collision Detection Rule

    • Identify PCR duplicates and sequencing artifacts
    rule collision_detection:
        # Identify reads with identical sequences
        # Flag potential artifacts vs genuine miRNA
  3. Implement Quality Filtering for Adapters

    • Add specific Illumina/TruSeq adapter sequences
    • Improves trimming accuracy
  4. Add Expression Normalization

    • CPM/TPM calculation
    • Library size adjustment

Priority 3: Performance & Reproducibility

  1. Enhanced Error Handling

    • Add set -euo pipefail to all shell commands
    • Implement checkpoints for missing references
  2. Optimize Salmon for Small RNA

    • Consider --trimStrict flag for better small RNA handling
    • Adjust k-mer size if needed

Priority 4: Downstream Analysis

  1. Add Differential Expression Framework

    • Integration with DESeq2 or edgeR
    • Statistical testing templates
  2. miRNA Target Prediction

    • Integrate miRNA target prediction tools (e.g., TargetScan, miRanda)

Additional Resources


License

This project is licensed under the MIT License - see the LICENSE file for details.


Support & Contribution

For issues, questions, or contributions:

  1. Check existing GitHub issues: https://github.com/masoodzaka/Snakemake_smallRNASeq/issues
  2. Review troubleshooting section above
  3. Submit detailed bug reports with:
    • Command used
    • Error message
    • Sample data information
    • System specifications

Contributions are welcome! Please create a pull request with improvements.

About

Small RNA-seq analysis using bowtie, featureCounts and Salmon tools.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors