Skip to content

hossainlab/nf-core-rnaseqmeta

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nf-core/rnaseqmeta

GitHub Actions CI Status GitHub Actions Linting Status AWS CI Cite with Zenodo nf-test

Nextflow run with conda run with docker run with singularity Launch on Seqera Platform

Get help on Slack Follow on Twitter Follow on Mastodon Watch on YouTube

Introduction

nf-core/rnaseqmeta is a bioinformatics pipeline designed for bulk RNA-seq meta-analysis across multiple studies. The pipeline processes raw RNA-seq data through quality control, alignment/quantification, batch correction, differential expression analysis, and meta-analysis to identify robust gene expression changes across multiple datasets.

The pipeline supports three alignment/quantification strategies:

  • STAR - Genome-based alignment with featureCounts (most accurate, best for splice junction discovery)
  • Salmon - Fast transcript pseudo-alignment with tximport (excellent balance of speed and accuracy)
  • Kallisto - Ultra-fast k-mer-based quantification with tximport (ideal for exploratory analysis and large datasets)

NEW: BioJupies Integration

The pipeline now includes optional BioJupies analysis modules for enhanced visualization and functional interpretation:

  • 📊 Interactive Volcano Plots - Publication-ready visualizations with hover information
  • 🧬 Comprehensive Enrichment Analysis - GO, KEGG, Reactome, WikiPathways, TF, Kinase, miRNA
  • 📈 Interactive Bar Charts - Explore enrichment results with ease
  • 💾 Downloadable Results - All data in TSV format for further analysis

Enable with --enable_biojupies flag. See BioJupies Usage Guide for details.

Pipeline Summary

The pipeline performs the following steps:

  1. Quality Control (FastQC, Fastp) - Read quality assessment and trimming
  2. Alignment/Quantification (choice of):
  3. Batch Correction - ComBat-seq or SVA for removing batch effects
  4. Differential Expression Analysis - edgeR/limma for identifying differentially expressed genes
  5. Enhanced Visualization (BioJupies) - Interactive volcano plots and visualizations (optional)
  6. Functional Enrichment (Enrichr) - GO, pathway, TF, kinase enrichment analysis (optional)
  7. Meta-Analysis - Random effects meta-analysis across multiple studies
  8. Quality Reporting (MultiQC) - Comprehensive QC report aggregation

Quick Start

Prerequisites

Before running the pipeline, ensure you have:

  1. Nextflow (version ≥24.04.2)

    curl -s https://get.nextflow.io | bash
  2. Container Engine (choose one):

  3. Input Data:

    • Raw RNA-seq FASTQ files (single-end or paired-end)
    • Reference genome files (FASTA and GTF) or genome name (e.g., GRCh38)
    • Sample metadata for meta-analysis

Test the Pipeline

# Clone the repository
git clone https://github.com/hossainlab/nf-core-rnaseqmeta.git
cd nf-core-rnaseqmeta

# Run test to verify everything works
nextflow run main.nf -profile test --outdir test_results

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Preparing Input Files

1. Samplesheet (Required)

Create a CSV file with your FASTQ file paths:

samplesheet.csv:

sample,fastq_1,fastq_2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
sample3,/path/to/sample3_R1.fastq.gz,

Column descriptions:

  • sample - Unique sample identifier (required)
  • fastq_1 - Path to read 1 FASTQ file (required)
  • fastq_2 - Path to read 2 FASTQ file (optional, leave empty for single-end data)

2. Sample Information File (Required)

Create a CSV file with sample metadata for meta-analysis:

sample_info.csv:

sample_id,condition,batch
sample1,treatment,batch1
sample2,control,batch1
sample3,treatment,batch2

Column descriptions:

  • sample_id - Must match sample names in samplesheet (required)
  • condition - Experimental condition, e.g., treatment, control, case (required)
  • batch - Batch or study identifier for batch correction (optional but recommended)

Running the Pipeline

Option 1: STAR Alignment (Default - Most Accurate)

Best for: Genome-based alignment, splice junction discovery, highest accuracy

nextflow run main.nf \
    -profile docker \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --genome GRCh38 \
    --outdir results

Option 2: Salmon Quantification (Fast & Accurate)

Best for: Balanced speed and accuracy, transcript-level analysis

nextflow run main.nf \
    -profile docker \
    --aligner salmon \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --transcript_fasta transcripts.fa \
    --gtf genes.gtf \
    --outdir results

Option 3: Kallisto Quantification (Fastest)

Best for: Exploratory analysis, large datasets, rapid iteration

nextflow run main.nf \
    -profile docker \
    --aligner kallisto \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --transcript_fasta transcripts.fa \
    --gtf genes.gtf \
    --outdir results

Using Custom Reference Files

nextflow run main.nf \
    -profile docker \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --fasta /path/to/genome.fa \
    --gtf /path/to/annotation.gtf \
    --outdir results

Advanced Usage with Execution Reports

nextflow run main.nf \
    -profile docker \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --genome GRCh38 \
    --outdir results \
    --perform_batch_correction true \
    --perform_meta_analysis true \
    -with-report execution_report.html \
    -with-timeline timeline.html \
    -with-trace trace.txt

With BioJupies Enhanced Analysis

nextflow run main.nf \
    -profile docker \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --genome GRCh38 \
    --enable_biojupies \
    --biojupies_enrichment_types 'go,pathway,tf' \
    --outdir results

This will generate:

  • Interactive volcano plots for each comparison
  • GO enrichment analysis (Biological Process, Molecular Function, Cellular Component)
  • Pathway enrichment (KEGG, WikiPathways, Reactome)
  • Transcription factor target enrichment
  • All results in interactive HTML and downloadable TSV formats

See the BioJupies Usage Guide for more options and details.

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Parameters

Required Parameters

Parameter Description
--input Path to samplesheet CSV file
--sample_info Path to sample information CSV file
--outdir Output directory for results

Reference Genome Options

Parameter Description Default
--genome Name of iGenomes reference (e.g., GRCh38, GRCm38) null
--fasta Path to genome FASTA file null
--gtf Path to genome annotation GTF file null
--transcript_fasta Path to transcript FASTA file (required for Salmon/Kallisto) null
--igenomes_base Base path to iGenomes references s3://ngi-igenomes/igenomes/
--igenomes_ignore Ignore iGenomes configuration false

Analysis Options

Parameter Description Default
--aligner Alignment/quantification tool: star, salmon, or kallisto star
--lib_type Salmon library type (A for auto-detection) A
--perform_batch_correction Enable batch effect correction true
--perform_meta_analysis Enable meta-analysis true

Execution Profiles

Profile Description
-profile test Run with test data
-profile docker Use Docker containers
-profile singularity Use Singularity containers
-profile conda Use Conda environments

Pipeline Output

Output Directory Structure

results/
├── fastqc/                     # FastQC reports for raw reads
├── fastp/                      # Fastp trimming reports and trimmed reads
├── star/                       # STAR alignment results (if using --aligner star)
│   ├── *.bam                   # Aligned BAM files
│   └── *Log.final.out          # Alignment statistics
├── salmon/                     # Salmon quantification (if using --aligner salmon)
│   └── */quant.sf              # Transcript quantification files
├── kallisto/                   # Kallisto quantification (if using --aligner kallisto)
│   └── */abundance.tsv         # Transcript quantification files
├── featurecounts/              # Gene count matrices (STAR mode)
│   └── *.featureCounts.txt     # Gene-level counts
├── batch_correction/           # Batch-corrected count matrices
│   └── *_batch_corrected.txt
├── differential_expression/    # DE analysis results
│   └── *_de_results.txt
├── biojupies/                  # BioJupies enhanced analysis (if --enable_biojupies)
│   ├── volcano_plots/          # Interactive and static volcano plots
│   │   ├── *_volcano.html      # Interactive plots
│   │   └── *_volcano.png       # Publication-ready images
│   └── enrichment/             # Functional enrichment results
│       ├── *_enrichment_results.tsv        # All enrichment data
│       ├── *_enrichment_summary.txt        # Summary statistics
│       └── *_enrichment_*.html             # Interactive enrichment plots
├── meta_analysis/              # Meta-analysis results
│   └── *_meta_analysis_results.txt
├── multiqc/                    # Comprehensive QC report
│   └── multiqc_report.html     # Main quality control report
└── pipeline_info/              # Pipeline execution information
    ├── execution_report.html
    ├── execution_timeline.html
    └── execution_trace.txt

Key Output Files

Quality Control

  • multiqc/multiqc_report.html - Comprehensive quality control report with all QC metrics
  • fastqc/*_fastqc.html - Individual FastQC reports for each sample
  • fastp/*.fastp.json - Trimming statistics and quality metrics

Quantification

  • featurecounts/*.featureCounts.txt - Gene count matrices (STAR mode)
  • salmon/*/quant.sf - Transcript abundance estimates (Salmon mode)
  • kallisto/*/abundance.tsv - Transcript abundance estimates (Kallisto mode)

Analysis Results

  • batch_correction/*_batch_corrected.txt - Batch-corrected count matrices
  • differential_expression/*_de_results.txt - Differentially expressed genes with statistics (log2FC, p-value, FDR)
  • meta_analysis/*_meta_analysis_results.txt - Meta-analysis results with effect sizes and combined p-values

Choosing an Aligner

All three aligners produce gene-level counts compatible with downstream analysis:

Aligner Speed Accuracy Best For Requirements
STAR Slow Highest Splice junction discovery, most accurate quantification Genome FASTA + GTF
Salmon Fast High Balanced workflow, transcript-level analysis Transcript FASTA + GTF
Kallisto Fastest High Large datasets, exploratory analysis, quick iterations Transcript FASTA + GTF

Resource Requirements

Minimum Requirements

  • CPU: 4 cores
  • Memory: 8 GB RAM
  • Storage: ~10x input data size for intermediate files

Recommended Requirements

  • CPU: 8+ cores
  • Memory: 32 GB RAM
  • Storage: Sufficient space for outputs (typically 5-10x input size)

Typical Runtime

  • Small dataset (10 samples, 20M reads each): 2-4 hours (Kallisto), 4-8 hours (STAR)
  • Medium dataset (50 samples, 30M reads each): 8-16 hours (Kallisto), 16-24 hours (STAR)

Troubleshooting

Common Issues

  1. Memory Issues

    • Increase memory allocation in your execution profile
    • Use Kallisto for faster analysis with lower memory footprint
    • Process fewer samples at a time
  2. Container/Environment Issues

    • Ensure Docker/Singularity is properly installed and running
    • Try alternative container engine: -profile singularity or -profile conda
    • On Mac M1/M2: Use -profile docker,arm
  3. Input File Issues

    • Use absolute paths for all input files
    • Verify FASTQ files are readable and not corrupted
    • Ensure sample names match between samplesheet and sample_info
  4. Sample Name Mismatch

    • Sample IDs in samplesheet.csv must exactly match those in sample_info.csv
    • Check for extra spaces, different capitalization, or special characters
  5. Reference Genome Issues

    • For Salmon/Kallisto: Ensure transcript FASTA matches the GTF annotation
    • For STAR: Ensure genome FASTA and GTF are from the same release

Resuming Failed Runs

Nextflow supports automatic resume of failed runs:

nextflow run main.nf -profile docker --input samplesheet.csv --sample_info sample_info.csv --outdir results -resume

Getting Help

  1. Check the MultiQC report (results/multiqc/multiqc_report.html) for quality issues
  2. Review .nextflow.log for detailed error messages
  3. Check pipeline execution reports in results/pipeline_info/
  4. Use --help to see all available parameters:
    nextflow run main.nf --help

What Files Do What

Input Files

  • samplesheet.csv - Lists all FASTQ files to be processed
  • sample_info.csv - Provides sample metadata for grouping and meta-analysis
  • Reference genome files - FASTA and GTF files for alignment and quantification

Configuration Files

  • nextflow.config - Main pipeline configuration
  • conf/base.config - Resource allocation settings
  • conf/modules.config - Module-specific parameters
  • conf/test.config - Test data configuration

Workflow Files

  • main.nf - Pipeline entry point
  • workflows/rnaseqmeta.nf - Main workflow logic
  • modules/ - Individual tool wrappers (STAR, Salmon, Kallisto, etc.)
  • subworkflows/ - Reusable workflow components

Credits

nf-core/rnaseqmeta was originally written by Md. Jubayer Hossain and Seqera AI.

We thank the following people for their extensive assistance in the development of this pipeline:

  • The nf-core community for providing the framework and best practices
  • Contributors to all the open-source tools integrated in this pipeline

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #rnaseqmeta channel (you can join with this invite).

Citations

If you use nf-core/rnaseqmeta for your analysis, please cite it using the following DOI: 10.5281/zenodo.XXXXXX

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Key Tools

  • FastQC - Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data.
  • STAR - Dobin A, et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 29(1):15-21.
  • Salmon - Patro R, et al. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 14(4):417-419.
  • Kallisto - Bray NL, et al. (2016). Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 34(5):525-7.
  • featureCounts - Liao Y, et al. (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 30(7):923-30.
  • MultiQC - Ewels P, et al. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 32(19):3047-8.

nf-core Framework

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

RNA-seq Meta-analysis Pipeline

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published