Skip to content

tnmquann/metaflow

Repository files navigation

metaflow

Nextflow run with conda run with docker run with singularity run with slurm

metaflow is a robust, modular pipeline built on Nextflow for comprehensive analysis of Illumina short-read metagenomic data. Designed with practical experience in mind, this workflow streamlines both rapid, read-level classification and high-resolution genome assembly, making it accessible to users with minimal computational background. It delivers fast, actionable results for professionals in molecular epidemiology, enabling timely insights for outbreak investigation and surveillance.

This pipeline was developed based on the concepts and design principles of nf-core/mag and SOMA. Many modifications have been made to the original pipelines, reflecting both practical experience and methodological preferences. Most importantly, metaflow is optimized to run efficiently on resource-limited devices, ensuring broad accessibility in both laboratory and field settings.

metaflow workflow diagram

Overview

This pipeline performs taxonomic profiling of shotgun metagenomics data through the following stages:

  • Read-based Analysis

    • Taxonomic assignment for each sequencing read is performed using an optimized combination of sourmash and YACHT. This approach ensures fast, scalable processing, even for large datasets. Reference databases can be customized to fit your specific research needs.
    • Detection of antimicrobial resistance genes (ARGs) is integrated via rgi_bwt, providing actionable insights into resistance profiles.
  • Metagenome Assembly & Binning

    • The pipeline assembles contigs and performs binning to recover Metagenome Assembled Genomes (MAGs), enabling in silico characterization of microbial communities.

For step-by-step installation instructions, parameter explanations, and advanced usage tutorials, please refer to wiki page.

Pipeline requirements and Installation overview

Software environment

  • Nextflow version 24 or later is required as the workflow engine and depends on Java versions 16–22 for execution stability.
  • Conda or Mamba is used for reproducible package and environment management, with Mamba preferred for faster dependency resolution.
  • Container runtimes such as Docker or Apptainer are optional but improve portability and reproducibility across heterogeneous systems.

Hardware and System requirements

  • A POSIX-compatible operating system is required, with Linux or macOS preferred; Windows is supported via WSL for compatibility.
  • Memory requirements scale with the assembler: a minimum of 32 GB RAM for MEGAHIT and up to 128 GB for metaSPAdes.
  • Disk usage is substantial, requiring at least 256 GB to accommodate databases, environments, intermediate files, and per-sample outputs; GPU acceleration is optional for select assembly-based steps.

Installation workflow

  • The pipeline can be obtained via Git cloning, downloading a release archive, or pulling directly with Nextflow for version-controlled deployment.
  • Nextflow installation involves validating Java, downloading the executable, optionally adding it to the system PATH, and verifying functionality.
  • Conda or Mamba installation finalizes the environment setup, with optional container runtime installation to enhance execution consistency.

Note

metaflow relies on sourmash and YACHT for robust taxonomic classification. For detailed setup instructions, please refer to the Manual Database Setup section.

Input specifications

  • Sample CSV format:
    sample_id,run_id,group,short_reads_1,short_reads_2,long_reads
    sample1,1,0,data/sample1_1.fq.gz,data/sample1_2.fq.gz,data/sample1.fastq.gz
    sample2,0,0,data/sample2_1.fastq.gz,data/sample2_2.fastq.gz,
  • Directory: A directory containing paired-end FASTQ files.

Quick start

Read-based analysis

Once your sourmash and YACHT databases are ready, you can launch the metaflow pipeline. Below are examples for different input formats.

# With csv file
nextflow run main.nf \
    --input /path/to/your/samples.csv \
    --input_format csv \
    -profile conda \
    --outdir /path/to/output/directory \
    --sourmash_database /path/to/your/sourmash_database \
    --yacht_database /path/to/your/yacht_database.json \
    --enable_readbase

# With directory of FASTQ files
nextflow run main.nf \
    --input /path/to/your/fastq_directory \
    --input_format directory \
    -profile conda \
    --outdir /path/to/output/directory \
    --sourmash_database /path/to/your/sourmash_database \
    --yacht_database /path/to/your/yacht_database.json \
    --enable_readbase

Assembly-based analysis

For assembly-based workflows, ensure your sourmash database is prepared.

nextflow run main.nf \
    --input /path/to/your/samples.csv \
    --input_format csv \
    -profile conda \
    --outdir /path/to/output/directory \
    --sourmash_database /path/to/your/sourmash_database

Output structure

Read-based subworkflow

The following tree illustrates the organization of output files generated by the read-based subworkflow:

outdir/
├── Databases/
│   └── RGI/
├── Preprocess/
│   ├── Merged_sequences/
│   └── QC/
│       ├── Raw_reads/
│       ├── Remove_HostGenome/
│       └── Trimming/
├── Remove_HostGenome/
├── Trimming/
├── Readbased_Analysis/
│   ├── rgi_bwt/
│   └── Sourmash-YACHT/
│       ├── final_results/
│       ├── fastmultigather/
│       ├── reads_collected/
│       ├── sketches/
│       │   ├── single_sketches/
│       │   └── batch.batch.manysketch.zip
│       ├── taxannotate/
│       ├── taxmetagenome/
│       └── yacht_results/

Each directory is created automatically by the pipeline and contains outputs relevant to its analysis step. The results of the ARG read-based analysis will be located in the folder ./Readbased_Analysis/rgi_bwt/, and the results of the read-based taxonomic classification will be located in ./Readbased_Analysis/Sourmash-YACHT/final_results/.

Assembly-based subworkflow

The following tree illustrates the organization of output files generated by the assembly-based subworkflow:

outdir/
├── Databases/
│   ├── genomad/
│   ├── bakta/
│   ├── CheckM/
│   ├── CheckM2/
│   ├── GUNC/
│   └── BUSCO/
├── Preprocess/
│   ├── Merged_sequences/
│   └── QC/
│       ├── Raw_reads/
│       ├── Remove_HostGenome/
│       └── Trimming/
├── Remove_HostGenome/
├── Trimming/
├── Assembly/
│   ├── MEGAHIT/
│   │   └── QC/
│   └── SPAdes/
│       └── QC/
├── Binning/
│   ├── Annotation/
│   │   ├── Prokka/
│   │   └── Bakta/
│   ├── Binette/
│   ├── DASTool/
│   ├── COMEBin/
│   ├── MaxBin2/
│   ├── MetaBAT2/
│   ├── SemiBin2/
│   ├── CONCOCT/
│   ├── VAMB/
│   ├── depths/
│   ├── mapping/
│   └── QC/
│       ├── AssemblyStats/
│       ├── CheckM/
│       ├── CheckM2/
│       ├── QUAST/
│       ├── GUNC/
│       └── BUSCO/
├── ContigsAnalysis/
│   ├── annotation/
│   │   ├── Prokka/
│   │   └── pyrodigal/
│   ├── contig_stats/
│   ├── coverage/
│   ├── genomad/
│   ├── skani/
│   └── mapped_reads/
├── Taxonomy/
│   ├── sourmash/
│   └── Tiara/

Error Handling

The pipeline implements an automatic error recovery strategy:

  • Maximum of 3 retries per task.
  • Automatic resource scaling on retry.
  • Configurable error strategy.

Author

  • Ton Ngoc Minh Quan
  • Mai Thu Si Nguyen

Citation

If you use this pipeline, please cite:

Minh-Quan T.N., Nguyen M.T.S. (2025). metaflow: A Nextflow pipeline for comprehensive analysis of short-read metagenomic data using sourmash and YACHT.

Acknowledgements

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license. We gratefully acknowledge the nf-core community for their tooling, framework, and support, which have been essential to the development of this pipeline.

The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Empowering bioinformatics communities with Nextflow and nf-core.
Langer, B.E., Amaral, A., Baudement, M.O., Bonath, F. et al.
Genome Biol. 26, 228 (2025). doi: 10.1186/s13059-025-03673-9

About

A modular Nextflow-based pipeline for efficient analysis of Illumina short-read metagenomic data, supporting both rapid read-level profiling and genome-resolved assembly, with a focus on accessibility, epidemiological surveillance, and reliable performance on resource-limited systems.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors