nf-core/rnaseqmeta is a bioinformatics pipeline designed for bulk RNA-seq meta-analysis across multiple studies. The pipeline processes raw RNA-seq data through quality control, alignment/quantification, batch correction, differential expression analysis, and meta-analysis to identify robust gene expression changes across multiple datasets.
The pipeline supports three alignment/quantification strategies:
- STAR - Genome-based alignment with featureCounts (most accurate, best for splice junction discovery)
- Salmon - Fast transcript pseudo-alignment with tximport (excellent balance of speed and accuracy)
- Kallisto - Ultra-fast k-mer-based quantification with tximport (ideal for exploratory analysis and large datasets)
The pipeline now includes optional BioJupies analysis modules for enhanced visualization and functional interpretation:
- 📊 Interactive Volcano Plots - Publication-ready visualizations with hover information
- 🧬 Comprehensive Enrichment Analysis - GO, KEGG, Reactome, WikiPathways, TF, Kinase, miRNA
- 📈 Interactive Bar Charts - Explore enrichment results with ease
- 💾 Downloadable Results - All data in TSV format for further analysis
Enable with --enable_biojupies flag. See BioJupies Usage Guide for details.
The pipeline performs the following steps:
- Quality Control (
FastQC,Fastp) - Read quality assessment and trimming - Alignment/Quantification (choice of):
- Batch Correction - ComBat-seq or SVA for removing batch effects
- Differential Expression Analysis - edgeR/limma for identifying differentially expressed genes
- Enhanced Visualization (
BioJupies) - Interactive volcano plots and visualizations (optional) - Functional Enrichment (
Enrichr) - GO, pathway, TF, kinase enrichment analysis (optional) - Meta-Analysis - Random effects meta-analysis across multiple studies
- Quality Reporting (
MultiQC) - Comprehensive QC report aggregation
Before running the pipeline, ensure you have:
-
Nextflow (version ≥24.04.2)
curl -s https://get.nextflow.io | bash -
Container Engine (choose one):
-
Input Data:
- Raw RNA-seq FASTQ files (single-end or paired-end)
- Reference genome files (FASTA and GTF) or genome name (e.g., GRCh38)
- Sample metadata for meta-analysis
# Clone the repository
git clone https://github.com/hossainlab/nf-core-rnaseqmeta.git
cd nf-core-rnaseqmeta
# Run test to verify everything works
nextflow run main.nf -profile test --outdir test_resultsNote
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
Create a CSV file with your FASTQ file paths:
samplesheet.csv:
sample,fastq_1,fastq_2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
sample3,/path/to/sample3_R1.fastq.gz,Column descriptions:
sample- Unique sample identifier (required)fastq_1- Path to read 1 FASTQ file (required)fastq_2- Path to read 2 FASTQ file (optional, leave empty for single-end data)
Create a CSV file with sample metadata for meta-analysis:
sample_info.csv:
sample_id,condition,batch
sample1,treatment,batch1
sample2,control,batch1
sample3,treatment,batch2Column descriptions:
sample_id- Must match sample names in samplesheet (required)condition- Experimental condition, e.g., treatment, control, case (required)batch- Batch or study identifier for batch correction (optional but recommended)
Best for: Genome-based alignment, splice junction discovery, highest accuracy
nextflow run main.nf \
-profile docker \
--input samplesheet.csv \
--sample_info sample_info.csv \
--genome GRCh38 \
--outdir resultsBest for: Balanced speed and accuracy, transcript-level analysis
nextflow run main.nf \
-profile docker \
--aligner salmon \
--input samplesheet.csv \
--sample_info sample_info.csv \
--transcript_fasta transcripts.fa \
--gtf genes.gtf \
--outdir resultsBest for: Exploratory analysis, large datasets, rapid iteration
nextflow run main.nf \
-profile docker \
--aligner kallisto \
--input samplesheet.csv \
--sample_info sample_info.csv \
--transcript_fasta transcripts.fa \
--gtf genes.gtf \
--outdir resultsnextflow run main.nf \
-profile docker \
--input samplesheet.csv \
--sample_info sample_info.csv \
--fasta /path/to/genome.fa \
--gtf /path/to/annotation.gtf \
--outdir resultsnextflow run main.nf \
-profile docker \
--input samplesheet.csv \
--sample_info sample_info.csv \
--genome GRCh38 \
--outdir results \
--perform_batch_correction true \
--perform_meta_analysis true \
-with-report execution_report.html \
-with-timeline timeline.html \
-with-trace trace.txtnextflow run main.nf \
-profile docker \
--input samplesheet.csv \
--sample_info sample_info.csv \
--genome GRCh38 \
--enable_biojupies \
--biojupies_enrichment_types 'go,pathway,tf' \
--outdir resultsThis will generate:
- Interactive volcano plots for each comparison
- GO enrichment analysis (Biological Process, Molecular Function, Cellular Component)
- Pathway enrichment (KEGG, WikiPathways, Reactome)
- Transcription factor target enrichment
- All results in interactive HTML and downloadable TSV formats
See the BioJupies Usage Guide for more options and details.
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
| Parameter | Description |
|---|---|
--input |
Path to samplesheet CSV file |
--sample_info |
Path to sample information CSV file |
--outdir |
Output directory for results |
| Parameter | Description | Default |
|---|---|---|
--genome |
Name of iGenomes reference (e.g., GRCh38, GRCm38) | null |
--fasta |
Path to genome FASTA file | null |
--gtf |
Path to genome annotation GTF file | null |
--transcript_fasta |
Path to transcript FASTA file (required for Salmon/Kallisto) | null |
--igenomes_base |
Base path to iGenomes references | s3://ngi-igenomes/igenomes/ |
--igenomes_ignore |
Ignore iGenomes configuration | false |
| Parameter | Description | Default |
|---|---|---|
--aligner |
Alignment/quantification tool: star, salmon, or kallisto |
star |
--lib_type |
Salmon library type (A for auto-detection) | A |
--perform_batch_correction |
Enable batch effect correction | true |
--perform_meta_analysis |
Enable meta-analysis | true |
| Profile | Description |
|---|---|
-profile test |
Run with test data |
-profile docker |
Use Docker containers |
-profile singularity |
Use Singularity containers |
-profile conda |
Use Conda environments |
results/
├── fastqc/ # FastQC reports for raw reads
├── fastp/ # Fastp trimming reports and trimmed reads
├── star/ # STAR alignment results (if using --aligner star)
│ ├── *.bam # Aligned BAM files
│ └── *Log.final.out # Alignment statistics
├── salmon/ # Salmon quantification (if using --aligner salmon)
│ └── */quant.sf # Transcript quantification files
├── kallisto/ # Kallisto quantification (if using --aligner kallisto)
│ └── */abundance.tsv # Transcript quantification files
├── featurecounts/ # Gene count matrices (STAR mode)
│ └── *.featureCounts.txt # Gene-level counts
├── batch_correction/ # Batch-corrected count matrices
│ └── *_batch_corrected.txt
├── differential_expression/ # DE analysis results
│ └── *_de_results.txt
├── biojupies/ # BioJupies enhanced analysis (if --enable_biojupies)
│ ├── volcano_plots/ # Interactive and static volcano plots
│ │ ├── *_volcano.html # Interactive plots
│ │ └── *_volcano.png # Publication-ready images
│ └── enrichment/ # Functional enrichment results
│ ├── *_enrichment_results.tsv # All enrichment data
│ ├── *_enrichment_summary.txt # Summary statistics
│ └── *_enrichment_*.html # Interactive enrichment plots
├── meta_analysis/ # Meta-analysis results
│ └── *_meta_analysis_results.txt
├── multiqc/ # Comprehensive QC report
│ └── multiqc_report.html # Main quality control report
└── pipeline_info/ # Pipeline execution information
├── execution_report.html
├── execution_timeline.html
└── execution_trace.txt
multiqc/multiqc_report.html- Comprehensive quality control report with all QC metricsfastqc/*_fastqc.html- Individual FastQC reports for each samplefastp/*.fastp.json- Trimming statistics and quality metrics
featurecounts/*.featureCounts.txt- Gene count matrices (STAR mode)salmon/*/quant.sf- Transcript abundance estimates (Salmon mode)kallisto/*/abundance.tsv- Transcript abundance estimates (Kallisto mode)
batch_correction/*_batch_corrected.txt- Batch-corrected count matricesdifferential_expression/*_de_results.txt- Differentially expressed genes with statistics (log2FC, p-value, FDR)meta_analysis/*_meta_analysis_results.txt- Meta-analysis results with effect sizes and combined p-values
All three aligners produce gene-level counts compatible with downstream analysis:
| Aligner | Speed | Accuracy | Best For | Requirements |
|---|---|---|---|---|
| STAR | Slow | Highest | Splice junction discovery, most accurate quantification | Genome FASTA + GTF |
| Salmon | Fast | High | Balanced workflow, transcript-level analysis | Transcript FASTA + GTF |
| Kallisto | Fastest | High | Large datasets, exploratory analysis, quick iterations | Transcript FASTA + GTF |
- CPU: 4 cores
- Memory: 8 GB RAM
- Storage: ~10x input data size for intermediate files
- CPU: 8+ cores
- Memory: 32 GB RAM
- Storage: Sufficient space for outputs (typically 5-10x input size)
- Small dataset (10 samples, 20M reads each): 2-4 hours (Kallisto), 4-8 hours (STAR)
- Medium dataset (50 samples, 30M reads each): 8-16 hours (Kallisto), 16-24 hours (STAR)
-
Memory Issues
- Increase memory allocation in your execution profile
- Use Kallisto for faster analysis with lower memory footprint
- Process fewer samples at a time
-
Container/Environment Issues
- Ensure Docker/Singularity is properly installed and running
- Try alternative container engine:
-profile singularityor-profile conda - On Mac M1/M2: Use
-profile docker,arm
-
Input File Issues
- Use absolute paths for all input files
- Verify FASTQ files are readable and not corrupted
- Ensure sample names match between samplesheet and sample_info
-
Sample Name Mismatch
- Sample IDs in
samplesheet.csvmust exactly match those insample_info.csv - Check for extra spaces, different capitalization, or special characters
- Sample IDs in
-
Reference Genome Issues
- For Salmon/Kallisto: Ensure transcript FASTA matches the GTF annotation
- For STAR: Ensure genome FASTA and GTF are from the same release
Nextflow supports automatic resume of failed runs:
nextflow run main.nf -profile docker --input samplesheet.csv --sample_info sample_info.csv --outdir results -resume- Check the MultiQC report (
results/multiqc/multiqc_report.html) for quality issues - Review
.nextflow.logfor detailed error messages - Check pipeline execution reports in
results/pipeline_info/ - Use
--helpto see all available parameters:nextflow run main.nf --help
- samplesheet.csv - Lists all FASTQ files to be processed
- sample_info.csv - Provides sample metadata for grouping and meta-analysis
- Reference genome files - FASTA and GTF files for alignment and quantification
- nextflow.config - Main pipeline configuration
- conf/base.config - Resource allocation settings
- conf/modules.config - Module-specific parameters
- conf/test.config - Test data configuration
- main.nf - Pipeline entry point
- workflows/rnaseqmeta.nf - Main workflow logic
- modules/ - Individual tool wrappers (STAR, Salmon, Kallisto, etc.)
- subworkflows/ - Reusable workflow components
nf-core/rnaseqmeta was originally written by Md. Jubayer Hossain and Seqera AI.
We thank the following people for their extensive assistance in the development of this pipeline:
- The nf-core community for providing the framework and best practices
- Contributors to all the open-source tools integrated in this pipeline
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack #rnaseqmeta channel (you can join with this invite).
If you use nf-core/rnaseqmeta for your analysis, please cite it using the following DOI: 10.5281/zenodo.XXXXXX
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
- FastQC - Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data.
- STAR - Dobin A, et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 29(1):15-21.
- Salmon - Patro R, et al. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 14(4):417-419.
- Kallisto - Bray NL, et al. (2016). Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 34(5):525-7.
- featureCounts - Liao Y, et al. (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 30(7):923-30.
- MultiQC - Ewels P, et al. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 32(19):3047-8.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.