A comprehensive bioinformatics workflow for processing and analyzing small RNA sequencing (sRNA-seq) data with a focus on microRNA (miRNA) discovery and quantification. This pipeline integrates quality control, sequence alignment, and expression quantification using industry-standard tools.
Author: Masood Zaka
License: MIT
Repository: https://github.com/masoodzaka/Snakemake_smallRNASeq
- Overview
- Pipeline Workflow
- Features
- Requirements
- Installation
- Data Preparation
- Configuration
- Usage
- Output Structure
- Quality Control
- Advanced Features
- Troubleshooting
- Citation
- Expert Recommendations
Small RNA sequencing (sRNA-seq) enables genome-wide discovery and quantification of small non-coding RNAs including microRNAs (miRNAs), siRNAs, piRNAs, and other regulatory RNAs. This pipeline is designed for:
- miRNA discovery and quantification using miRBase annotation
- End-to-end processing from raw FASTQ to expression matrices
- Quality assessment at multiple checkpoints
- Dual quantification methods combining alignment-based (featureCounts) and pseudo-alignment (Salmon) approaches
- Reproducible research with containerized environments and version-controlled dependencies
- Small RNA profiling in disease studies (e.g., cancer, diabetes)
- miRNA biomarker discovery
- Comparative expression analysis across conditions/tissues
- Investigation of circulating miRNAs in body fluids
Raw FASTQ reads
↓
[FastQC] - Initial Quality Assessment
↓
[Cutadapt] - 3' Adapter Removal & Length Filtering
↓
[FastQC] - Post-trimming Quality Control
↓
[Bowtie] - Alignment to Reference Genome
↓
[Samtools] - BAM Conversion, Sorting & Indexing
↓
├─→ [Samtools Filter] - Extract miRNA Regions (BED)
│ ↓
│ ├─→ [Bedtools] - Annotation to miRNA coordinates
│ │
│ └─→ [featureCounts] - miRNA Expression Quantification
│
└─→ [BAM to FastQ] - Convert for Salmon processing
↓
[Salmon Index] - Build Salmon index from transcripts
↓
[Salmon Quant] - Pseudo-alignment quantification
↓
[MultiQC] - Consolidated QC Report
✅ Comprehensive Quality Control
- FastQC before and after trimming
- Samtools flagstat for alignment statistics
- Adapter content and quality filtering metrics
- MultiQC aggregated reporting
✅ Optimized for Small RNA
- Appropriate read length constraints (18-30 bp for miRNAs)
- miRNA-specific alignment parameters
- Strand-specific quantification support
✅ Dual Quantification Methods
- Alignment-based approach (featureCounts) for high confidence
- Pseudo-alignment (Salmon) for improved speed and sensitivity
- Cross-validation of results
✅ Production-Ready
- Conda environments for reproducibility
- Automatic reference data downloading
- Snakemake best practices
- Comprehensive logging
✅ Annotation Resources
- Automatic GENCODE reference download (v41, GRCh38)
- miRBase v22 annotation inclusion
- GTF/GFF format reference files
- OS: Linux/macOS
- Disk Space: Minimum 50 GB (including reference genomes)
- RAM: 16-32 GB recommended
- CPU Cores: 8+ cores recommended
All software is managed via Conda environments. Base requirements:
| Tool | Version | Purpose |
|---|---|---|
| Snakemake | 7.0+ | Workflow orchestration |
| Conda/Mamba | Latest | Environment management |
| Python | 3.8+ | Scripting |
Per-rule tools:
- FastQC (0.11.8+) - Read quality assessment
- Cutadapt (4.1+) - Adapter trimming
- Bowtie (1.3.1) - Short-read alignment
- Samtools (1.15+) - BAM file manipulation
- Bedtools (2.30+) - Genomic interval operations
- Subread/featureCounts (2.0.1+) - Read quantification
- Salmon (1.9.0+) - Pseudo-alignment quantification
- MultiQC (1.12+) - Report aggregation
- umi_tools (latest) - UMI deduplication (optional)
git clone https://github.com/masoodzaka/Snakemake_smallRNASeq.git
cd Snakemake_smallRNASeq# Option A: Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Option B: Mamba (faster package resolution)
conda install -c conda-forge mamba# Using mamba (recommended for speed)
mamba create -c conda-forge -c bioconda -n snakemake snakemake=7.0
# Or with conda
conda create -c conda-forge -c bioconda -n snakemake snakemake=7.0
# Activate environment
conda activate snakemakesnakemake --versionCreate the following directory structure in your project root:
Snakemake_smallRNASeq/
├── data/
│ └── Reads/
│ ├── Sample1/
│ │ └── Sample1.fq.gz
│ ├── Sample2/
│ │ └── Sample2.fq.gz
│ └── ...
├── Snakefile
├── envs/
├── config.yaml (optional)
└── ...
- Format: Compressed FASTQ (
.fq.gzor.fastq.gz) - Quality Score: Phred+33 (modern Illumina/Ion Torrent)
- Expected Read Length: 15-35 bp (typically 18-30 bp for mature miRNAs)
The pipeline uses sample names defined in the SAMPLES list in the Snakefile:
SAMPLES=["DFU01","DFU02","DFU03","DFU04","DM01","DM02","DM03","DM04"]Each sample name corresponds to files at: data/Reads/{sample}/{sample}.fq.gz
The pipeline currently uses hardcoded sample names and paths. Recommended improvement: Create a config.yaml file:
# config.yaml
samples:
- DFU01
- DFU02
- DFU03
- DFU04
- DM01
- DM02
- DM03
- DM04
data_dir: data/Reads
results_dir: results
# Reference configuration
reference:
organism: human
gencode_release: 41
mirbase_version: 22
assembly: GRCh38
# Trimming parameters
trimming:
min_length: 18
max_length: 30
quality_cutoff: 20
adapters: [] # Add specific adapters if needed
# Alignment parameters
alignment:
max_mismatches: 1
seed_length: 20
max_hits: 1
# Quantification
quantification:
stranded: true # Set to true for strand-specific quantification
overlap_fraction: 0.2If using specific library kits, add 3' adapter sequences:
cutadapt \
-a GGCCGGCCGG \ # Illumina small RNA kit
-a AGATCGGAAGAGCACA \ # TruSeq
...Adjust bowtie parameters for sensitivity/specificity trade-off:
bowtie \
-n 2 \ # Allow 2 mismatches in seed region (more sensitive)
-e 70 \ # Set maximum allowed sum of qualities at mismatched positions
--best \ # Report only best alignment
-m 5 \ # Report up to 5 alignments per read
...# Activate Snakemake environment
conda activate snakemake
# Dry run (preview what will be executed)
snakemake --cores 8 --use-conda -n
# Execute pipeline
snakemake --cores 8 --use-conda
# With progress bar
snakemake --cores 8 --use-conda --progress
# Run specific rules only
snakemake bowtie_align --cores 8 --use-conda# Run with verbose logging
snakemake --cores 8 --use-conda --verbose
# Create Directed Acyclic Graph (DAG) visualization
snakemake --dag | dot -Tpng > pipeline_dag.png
# Execute with custom resource allocation
snakemake --cores 8 --resources mem_mb=32000 --use-conda
# Run on cluster (example: SLURM)
snakemake --cores 128 --cluster "sbatch --cpus-per-task {threads} --mem {resources.mem_mb}" --use-conda
# Profile-guided optimization
snakemake --cores 8 --use-conda --profile=slurmOn a modern system (16 cores, 32GB RAM):
- Whole pipeline (8 samples): 2-4 hours
- Reference download: 15-30 minutes (first run only)
- Trimming: 10-15 minutes
- Alignment: 30-45 minutes
- Quantification: 20-30 minutes
Project_Root/
├── qc/
│ ├── fastqc/ # Pre-trimming quality reports
│ ├── fastqc_trimmed/ # Post-trimming quality reports
│ ├── cutadapt/ # Trimming statistics (JSON)
│ ├── samtools/ # Alignment flagstat files
│ └── multiqc_report.html # Aggregated QC report
├── cutadapt/
│ └── {sample}.trimmed.fq.gz # Trimmed reads (intermediate)
├── bowtie_align/
│ ├── {sample}.sorted.bam # Sorted BAM files
│ ├── {sample}.sorted.bam.bai # BAM indices
│ └── {sample}.log # Alignment logs
├── filtered_bam/
│ ├── {sample}.bam # miRNA-region filtered BAM
│ └── {sample}.bam.bai # Index files
├── filtered_fastq/
│ └── {sample}.fastq.gz # FASTQ from miRNA regions
├── annotation/
│ └── {sample}.bed # miRNA annotations (BED format)
├── featureCounts/
│ ├── {sample}.featureCounts # miRNA quantification matrix
│ └── {sample}.featureCounts.summary # Quantification stats
├── salmon/
│ └── {sample}_quant/
│ ├── quant.sf # Salmon quantification
│ ├── aux_info/ # Auxiliary information
│ └── ...
└── refs/
├── reference.fasta # Genome reference
├── transcripts.fasta # Transcript reference
├── reference.gtf # Gene annotation
├── mirbase.gff # miRNA coordinates
├── mirbase.bed # miRNA regions (BED)
├── salmon_index/ # Salmon index
└── gentrome.fasta # Combined reference
| File | Format | Description |
|---|---|---|
qc/multiqc_report.html |
HTML | Interactive QC summary |
featureCounts/{sample}.featureCounts |
TSV | miRNA expression matrix |
salmon/{sample}_quant/quant.sf |
TSV | Transcript abundances |
bowtie_align/{sample}.sorted.bam |
BAM | Aligned reads |
annotation/{sample}.bed |
BED | Annotated miRNA alignments |
The final QC report (qc/multiqc_report.html) aggregates:
-
FastQC Results
- Per-base quality scores
- Adapter content
- GC distribution
- N content
-
Cutadapt Statistics
- Adapter removal efficiency
- Read length distribution
- Quality score filtering results
-
Alignment Metrics
- Mapping rates
- Strand distribution (if applicable)
- Insert size distribution (paired-end)
-
Quantification Summary
- featureCounts assignment rates
- Unassigned read categories
| Checkpoint | Metric | Expected Range |
|---|---|---|
| Raw reads (FastQC) | Mean quality ≥ Q25 | >95% |
| Trimmed reads | Length 18-30 bp | >80% |
| Mapping rate | Uniquely mapped | 40-70% |
| miRNA assignment | Assigned to miRNA | 30-60% |
| Issue | Possible Cause | Solution |
|---|---|---|
| Low mapping rate (<30%) | Wrong reference organism | Verify reference selection |
| High adapter content | Incomplete trimming | Adjust cutadapt parameters |
| Low miRNA assignment | Poor alignment or annotation | Check miRNA region overlap |
| High quality drops | Library prep issues | Review input FASTQ quality |
If your library uses UMIs, add deduplication:
# After alignment and BAM sorting
rule umi_extract:
input: "bowtie_align/{sample}.sorted.bam"
output: "dedup/{sample}.umi_extracted.bam"
conda: "envs/umitools.yaml"
shell: "umi_tools extract -I {input} -S {output}"
rule umi_dedup:
input: "dedup/{sample}.umi_extracted.bam"
output: "dedup/{sample}.dedup.bam"
conda: "envs/umitools.yaml"
shell: "umi_tools dedup -I {input} -S {output}"Modify the cutadapt rule to accept kit-specific adapters:
params:
adapters="-a GGCCGGCCGG" # Specify in Snakefile or config
shell:
cutadapt {params.adapters} ...Add post-processing for CPM/TPM normalization:
# normalize_expression.R
library(edgeR)
counts <- read.table("featureCounts/combined.txt", row.names=1)
cpm <- cpm(counts)
write.csv(cpm, file="results/expression_cpm.csv")Integrate with downstream analysis:
# deseq2_analysis.R
library(DESeq2)
# Create DESeqDataSet from featureCounts outputs
# ... differential analysis pipeline# Solution: Activate conda environment
conda activate snakemake
# Or ensure the environment is properly created
snakemake --use-conda --cores 8# Solution: Verify input file structure
ls -la data/Reads/
# Ensure files match SAMPLES variable in Snakefile# Check sequence quality
zcat data/Reads/Sample1/Sample1.fq.gz | head -20
# Verify reference organism matches input data
# Consider adjusting bowtie parameters for more permissive alignment# Solution: Limit parallel execution
snakemake --cores 4 --use-conda
# Or allocate more system resources
snakemake --cores 8 --resources mem_mb=64000 --use-conda# Snakemake debug mode
snakemake --cores 8 --use-conda --debug
# Keep intermediate files for inspection
snakemake --cores 8 --use-conda --notemp
# Print all shell commands
snakemake --cores 8 --use-conda --quietIf you use this pipeline in your research, please cite:
@software{snakemake_smallrnaseq,
author = {Zaka, Masood},
title = {Snakemake Small RNA-seq Pipeline},
url = {https://github.com/masoodzaka/Snakemake_smallRNASeq},
year = {2024}
}Also cite the tools used:
- Snakemake: Mölder, F., et al. (2021). Sustainable data analysis with Snakemake. F1000Research, 10, 33.
- FastQC: Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data.
- Cutadapt: Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.Journal, 17(1), 10-12.
- Bowtie: Langmead, B., et al. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25.
- featureCounts: Liao, Y., et al. (2014). featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7), 923-930.
- Salmon: Patro, R., et al. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4), 417-419.
- miRBase: Kozomara, A., & Griffiths-Jones, S. (2014). miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research, 42(D1), D68-D73.
-
Upgrade Bowtie to Bowtie2
- Better handling of mismatches and indels
- More sensitive for SNP-containing miRNAs
# envs/bowtie2.yaml channels: - conda-forge - bioconda dependencies: - bowtie2 >= 2.5.0
-
Implement UMI Deduplication
- Remove PCR duplicates for accurate quantification
- Essential for small RNA where duplicates bias expression estimates
rule markdup: conda: "envs/samtools.yaml" shell: "samtools markdup -r {input} {output}" -
Pin All Package Versions
- Current
umitools.yamllacks version - Prevents reproducibility issues
- umi_tools = 1.1.4 - fastqc = 0.11.9 - multiqc = 1.14
- Current
-
Add Strand-Specific Quantification
- Change featureCounts from
-s 0to-s 1or-s 2 - miRNAs are strand-specific; current setting may cause misassignment
- Change featureCounts from
-
Create Config File
- Enable parameter flexibility without editing Snakefile
- Support multiple experiment types
-
Add Collision Detection Rule
- Identify PCR duplicates and sequencing artifacts
rule collision_detection: # Identify reads with identical sequences # Flag potential artifacts vs genuine miRNA -
Implement Quality Filtering for Adapters
- Add specific Illumina/TruSeq adapter sequences
- Improves trimming accuracy
-
Add Expression Normalization
- CPM/TPM calculation
- Library size adjustment
-
Enhanced Error Handling
- Add
set -euo pipefailto all shell commands - Implement checkpoints for missing references
- Add
-
Optimize Salmon for Small RNA
- Consider
--trimStrictflag for better small RNA handling - Adjust k-mer size if needed
- Consider
-
Add Differential Expression Framework
- Integration with DESeq2 or edgeR
- Statistical testing templates
-
miRNA Target Prediction
- Integrate miRNA target prediction tools (e.g., TargetScan, miRanda)
- miRBase Database: https://www.mirbase.org/
- GENCODE Annotation: https://www.gencodegenes.org/
- Snakemake Documentation: https://snakemake.readthedocs.io/
- SEQanswers Community: https://seqanswers.com/
This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or contributions:
- Check existing GitHub issues: https://github.com/masoodzaka/Snakemake_smallRNASeq/issues
- Review troubleshooting section above
- Submit detailed bug reports with:
- Command used
- Error message
- Sample data information
- System specifications
Contributions are welcome! Please create a pull request with improvements.
