GitHub - hossainlab/nf-core-rnaseqmeta: RNA-seq Meta-analysis Pipeline

Introduction

nf-core/rnaseqmeta is a bioinformatics pipeline designed for bulk RNA-seq meta-analysis across multiple studies. The pipeline processes raw RNA-seq data through quality control, alignment/quantification, batch correction, differential expression analysis, and meta-analysis to identify robust gene expression changes across multiple datasets.

The pipeline supports three alignment/quantification strategies:

STAR - Genome-based alignment with featureCounts (most accurate, best for splice junction discovery)
Salmon - Fast transcript pseudo-alignment with tximport (excellent balance of speed and accuracy)
Kallisto - Ultra-fast k-mer-based quantification with tximport (ideal for exploratory analysis and large datasets)

NEW: BioJupies Integration

The pipeline now includes optional BioJupies analysis modules for enhanced visualization and functional interpretation:

📊 Interactive Volcano Plots - Publication-ready visualizations with hover information
🧬 Comprehensive Enrichment Analysis - GO, KEGG, Reactome, WikiPathways, TF, Kinase, miRNA
📈 Interactive Bar Charts - Explore enrichment results with ease
💾 Downloadable Results - All data in TSV format for further analysis

Enable with --enable_biojupies flag. See BioJupies Usage Guide for details.

Pipeline Summary

The pipeline performs the following steps:

Quality Control (FastQC, Fastp) - Read quality assessment and trimming
Alignment/Quantification (choice of):
- STAR + featureCounts - Genome-based alignment and counting
- Salmon + tximport - Transcript quantification
- Kallisto + tximport - K-mer-based quantification
Batch Correction - ComBat-seq or SVA for removing batch effects
Differential Expression Analysis - edgeR/limma for identifying differentially expressed genes
Enhanced Visualization (BioJupies) - Interactive volcano plots and visualizations (optional)
Functional Enrichment (Enrichr) - GO, pathway, TF, kinase enrichment analysis (optional)
Meta-Analysis - Random effects meta-analysis across multiple studies
Quality Reporting (MultiQC) - Comprehensive QC report aggregation

Quick Start

Prerequisites

Before running the pipeline, ensure you have:

Nextflow (version ≥24.04.2)
```
curl -s https://get.nextflow.io | bash
```
Container Engine (choose one):
- Docker
- Singularity
- Conda
Input Data:
- Raw RNA-seq FASTQ files (single-end or paired-end)
- Reference genome files (FASTA and GTF) or genome name (e.g., GRCh38)
- Sample metadata for meta-analysis

Test the Pipeline

# Clone the repository
git clone https://github.com/hossainlab/nf-core-rnaseqmeta.git
cd nf-core-rnaseqmeta

# Run test to verify everything works
nextflow run main.nf -profile test --outdir test_results

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Preparing Input Files

1. Samplesheet (Required)

Create a CSV file with your FASTQ file paths:

samplesheet.csv:

sample,fastq_1,fastq_2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
sample3,/path/to/sample3_R1.fastq.gz,

Column descriptions:

sample - Unique sample identifier (required)
fastq_1 - Path to read 1 FASTQ file (required)
fastq_2 - Path to read 2 FASTQ file (optional, leave empty for single-end data)

2. Sample Information File (Required)

Create a CSV file with sample metadata for meta-analysis:

sample_info.csv:

sample_id,condition,batch
sample1,treatment,batch1
sample2,control,batch1
sample3,treatment,batch2

Column descriptions:

sample_id - Must match sample names in samplesheet (required)
condition - Experimental condition, e.g., treatment, control, case (required)
batch - Batch or study identifier for batch correction (optional but recommended)

Running the Pipeline

Option 1: STAR Alignment (Default - Most Accurate)

Best for: Genome-based alignment, splice junction discovery, highest accuracy

nextflow run main.nf \
    -profile docker \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --genome GRCh38 \
    --outdir results

Option 2: Salmon Quantification (Fast & Accurate)

Best for: Balanced speed and accuracy, transcript-level analysis

nextflow run main.nf \
    -profile docker \
    --aligner salmon \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --transcript_fasta transcripts.fa \
    --gtf genes.gtf \
    --outdir results

Option 3: Kallisto Quantification (Fastest)

Best for: Exploratory analysis, large datasets, rapid iteration

nextflow run main.nf \
    -profile docker \
    --aligner kallisto \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --transcript_fasta transcripts.fa \
    --gtf genes.gtf \
    --outdir results

Using Custom Reference Files

nextflow run main.nf \
    -profile docker \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --fasta /path/to/genome.fa \
    --gtf /path/to/annotation.gtf \
    --outdir results

Advanced Usage with Execution Reports

nextflow run main.nf \
    -profile docker \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --genome GRCh38 \
    --outdir results \
    --perform_batch_correction true \
    --perform_meta_analysis true \
    -with-report execution_report.html \
    -with-timeline timeline.html \
    -with-trace trace.txt

With BioJupies Enhanced Analysis

nextflow run main.nf \
    -profile docker \
    --input samplesheet.csv \
    --sample_info sample_info.csv \
    --genome GRCh38 \
    --enable_biojupies \
    --biojupies_enrichment_types 'go,pathway,tf' \
    --outdir results

This will generate:

Interactive volcano plots for each comparison
GO enrichment analysis (Biological Process, Molecular Function, Cellular Component)
Pathway enrichment (KEGG, WikiPathways, Reactome)
Transcription factor target enrichment
All results in interactive HTML and downloadable TSV formats

See the BioJupies Usage Guide for more options and details.

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Parameters

Required Parameters

Parameter	Description
`--input`	Path to samplesheet CSV file
`--sample_info`	Path to sample information CSV file
`--outdir`	Output directory for results

Reference Genome Options

Parameter	Description	Default
`--genome`	Name of iGenomes reference (e.g., GRCh38, GRCm38)	`null`
`--fasta`	Path to genome FASTA file	`null`
`--gtf`	Path to genome annotation GTF file	`null`
`--transcript_fasta`	Path to transcript FASTA file (required for Salmon/Kallisto)	`null`
`--igenomes_base`	Base path to iGenomes references	`s3://ngi-igenomes/igenomes/`
`--igenomes_ignore`	Ignore iGenomes configuration	`false`

Analysis Options

Parameter	Description	Default
`--aligner`	Alignment/quantification tool: `star`, `salmon`, or `kallisto`	`star`
`--lib_type`	Salmon library type (A for auto-detection)	`A`
`--perform_batch_correction`	Enable batch effect correction	`true`
`--perform_meta_analysis`	Enable meta-analysis	`true`

Execution Profiles

Profile	Description
`-profile test`	Run with test data
`-profile docker`	Use Docker containers
`-profile singularity`	Use Singularity containers
`-profile conda`	Use Conda environments

Pipeline Output

Output Directory Structure

results/
├── fastqc/                     # FastQC reports for raw reads
├── fastp/                      # Fastp trimming reports and trimmed reads
├── star/                       # STAR alignment results (if using --aligner star)
│   ├── *.bam                   # Aligned BAM files
│   └── *Log.final.out          # Alignment statistics
├── salmon/                     # Salmon quantification (if using --aligner salmon)
│   └── */quant.sf              # Transcript quantification files
├── kallisto/                   # Kallisto quantification (if using --aligner kallisto)
│   └── */abundance.tsv         # Transcript quantification files
├── featurecounts/              # Gene count matrices (STAR mode)
│   └── *.featureCounts.txt     # Gene-level counts
├── batch_correction/           # Batch-corrected count matrices
│   └── *_batch_corrected.txt
├── differential_expression/    # DE analysis results
│   └── *_de_results.txt
├── biojupies/                  # BioJupies enhanced analysis (if --enable_biojupies)
│   ├── volcano_plots/          # Interactive and static volcano plots
│   │   ├── *_volcano.html      # Interactive plots
│   │   └── *_volcano.png       # Publication-ready images
│   └── enrichment/             # Functional enrichment results
│       ├── *_enrichment_results.tsv        # All enrichment data
│       ├── *_enrichment_summary.txt        # Summary statistics
│       └── *_enrichment_*.html             # Interactive enrichment plots
├── meta_analysis/              # Meta-analysis results
│   └── *_meta_analysis_results.txt
├── multiqc/                    # Comprehensive QC report
│   └── multiqc_report.html     # Main quality control report
└── pipeline_info/              # Pipeline execution information
    ├── execution_report.html
    ├── execution_timeline.html
    └── execution_trace.txt

Key Output Files

Quality Control

multiqc/multiqc_report.html - Comprehensive quality control report with all QC metrics
fastqc/*_fastqc.html - Individual FastQC reports for each sample
fastp/*.fastp.json - Trimming statistics and quality metrics

Quantification

featurecounts/*.featureCounts.txt - Gene count matrices (STAR mode)
salmon/*/quant.sf - Transcript abundance estimates (Salmon mode)
kallisto/*/abundance.tsv - Transcript abundance estimates (Kallisto mode)

Analysis Results

batch_correction/*_batch_corrected.txt - Batch-corrected count matrices
differential_expression/*_de_results.txt - Differentially expressed genes with statistics (log2FC, p-value, FDR)
meta_analysis/*_meta_analysis_results.txt - Meta-analysis results with effect sizes and combined p-values

Choosing an Aligner

All three aligners produce gene-level counts compatible with downstream analysis:

Aligner	Speed	Accuracy	Best For	Requirements
STAR	Slow	Highest	Splice junction discovery, most accurate quantification	Genome FASTA + GTF
Salmon	Fast	High	Balanced workflow, transcript-level analysis	Transcript FASTA + GTF
Kallisto	Fastest	High	Large datasets, exploratory analysis, quick iterations	Transcript FASTA + GTF

Resource Requirements

Minimum Requirements

CPU: 4 cores
Memory: 8 GB RAM
Storage: ~10x input data size for intermediate files

Recommended Requirements

CPU: 8+ cores
Memory: 32 GB RAM
Storage: Sufficient space for outputs (typically 5-10x input size)

Typical Runtime

Small dataset (10 samples, 20M reads each): 2-4 hours (Kallisto), 4-8 hours (STAR)
Medium dataset (50 samples, 30M reads each): 8-16 hours (Kallisto), 16-24 hours (STAR)

Troubleshooting

Common Issues

Memory Issues
- Increase memory allocation in your execution profile
- Use Kallisto for faster analysis with lower memory footprint
- Process fewer samples at a time
Container/Environment Issues
- Ensure Docker/Singularity is properly installed and running
- Try alternative container engine: -profile singularity or -profile conda
- On Mac M1/M2: Use -profile docker,arm
Input File Issues
- Use absolute paths for all input files
- Verify FASTQ files are readable and not corrupted
- Ensure sample names match between samplesheet and sample_info
Sample Name Mismatch
- Sample IDs in samplesheet.csv must exactly match those in sample_info.csv
- Check for extra spaces, different capitalization, or special characters
Reference Genome Issues
- For Salmon/Kallisto: Ensure transcript FASTA matches the GTF annotation
- For STAR: Ensure genome FASTA and GTF are from the same release

Resuming Failed Runs

Nextflow supports automatic resume of failed runs:

nextflow run main.nf -profile docker --input samplesheet.csv --sample_info sample_info.csv --outdir results -resume

Getting Help

Check the MultiQC report (results/multiqc/multiqc_report.html) for quality issues
Review .nextflow.log for detailed error messages
Check pipeline execution reports in results/pipeline_info/
Use --help to see all available parameters:
```
nextflow run main.nf --help
```

What Files Do What

Input Files

samplesheet.csv - Lists all FASTQ files to be processed
sample_info.csv - Provides sample metadata for grouping and meta-analysis
Reference genome files - FASTA and GTF files for alignment and quantification

Configuration Files

nextflow.config - Main pipeline configuration
conf/base.config - Resource allocation settings
conf/modules.config - Module-specific parameters
conf/test.config - Test data configuration

Workflow Files

main.nf - Pipeline entry point
workflows/rnaseqmeta.nf - Main workflow logic
modules/ - Individual tool wrappers (STAR, Salmon, Kallisto, etc.)
subworkflows/ - Reusable workflow components

Credits

nf-core/rnaseqmeta was originally written by Md. Jubayer Hossain and Seqera AI.

We thank the following people for their extensive assistance in the development of this pipeline:

The nf-core community for providing the framework and best practices
Contributors to all the open-source tools integrated in this pipeline

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #rnaseqmeta channel (you can join with this invite).

Citations

If you use nf-core/rnaseqmeta for your analysis, please cite it using the following DOI: 10.5281/zenodo.XXXXXX

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Key Tools

FastQC - Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data.
STAR - Dobin A, et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 29(1):15-21.
Salmon - Patro R, et al. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 14(4):417-419.
Kallisto - Bray NL, et al. (2016). Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 34(5):525-7.
featureCounts - Liao Y, et al. (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 30(7):923-30.
MultiQC - Ewels P, et al. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 32(19):3047-8.

nf-core Framework

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
subworkflows		subworkflows
workflows		workflows
.gitignore		.gitignore
.nflint.json		.nflint.json
BIOJUPIES_INTEGRATION_PLAN.md		BIOJUPIES_INTEGRATION_PLAN.md
BIOJUPIES_POC_SUMMARY.md		BIOJUPIES_POC_SUMMARY.md
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
PIPELINE_OVERVIEW.md		PIPELINE_OVERVIEW.md
QUICK_INTEGRATION_GUIDE.md		QUICK_INTEGRATION_GUIDE.md
QUICK_START.md		QUICK_START.md
README.md		README.md
RUNNING_GUIDE.md		RUNNING_GUIDE.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
ro-crate-metadata.json		ro-crate-metadata.json
test_biojupies.sh		test_biojupies.sh
tower.yml		tower.yml

License

hossainlab/nf-core-rnaseqmeta

Folders and files

Latest commit

History

Repository files navigation