Genomes Generation Pipeline

MGnify genomes generation pipeline (GGP) produces prokaryotic and eukaryotic MAGs from raw reads and corresponding assemblies.

This pipeline does not support co-binning.

Pipeline summary

The pipeline performs the following tasks:

Supports short reads.
Changes read headers to their corresponding assembly accessions (in the ERZ namespace).
Quality trims the reads, removes adapters fastp.

Afterward, the pipeline:

Runs a decontamination step using BWA to remove any host reads. By default, it uses the hg38 human genome reference.
Bins the contigs using Concoct, MetaBAT2 and MaxBin2.

For prokaryotes:

Refines the bins using Binette.
Conducts bin quality control with CAT, GUNC, and CheckM.
Performs dereplication with dRep.
Calculates coverage using MetaBAT2 calculated depths.
Detects rRNA and tRNA using cmsearch.
Assigns taxonomy with GTDBtk.

For eukaryotes:

Estimates quality and merges bins using EukCC.
Dereplicates MAGs using dRep.
Calculates coverage using MetaBAT2 calculated depths.
Assesses quality with BUSCO and EukCC.
Assigns taxonomy with BAT.

Final steps:

Tools versions are available in software_versions.yml
Pipeline generates a tsv table for public MAG uploader
TODO: finish MultiQC

Usage

If this the first time running nextflow please refer to this page

Required reference databases

You need to download the mentioned databases and specify as inputs to parameters.

Tools and databases used in the pipeline

Tool/Database	Version	Purpose
BUSCO	5.4.7 (DB v2024-01-08)	Assign genome quality
CAT	5.2.3 (DB v2021-01-07)	Taxonomic classification
CheckM2	1.0.1	Determining genome quality
EukCC	2.1.3 (DB v1.2)	Completeness and contamination of eukaryotic genomes
GUNC	4 (DB v2.0.4)	Quality control
GTDB-Tk + ar53_metadata_r.tsv, bac120_metadata_r.tsv from here	2.3.0 (DB release214)	Assigning taxonomy; generating alignments
Rfam	14.9	Database for identification of SSU/LSU rRNA and other ncRNAs
Human reference genome hg38	hg38	The reference genome of your choice for decontamination including bwa-mem2 index, for example, human. Format: `genome.fna`; `bwa-mem2/` folder containing: -`.fna.0123` -`.fna.amb` -`.fna.ann` -`.fna.bwt.2bit.64` -`*.fna.pac`

Pipeline inputs

If you will use ENA data follow instructions. Otherwise, download your data and keep format as recommended in inputs description below.

samplesheet.csv

Each row corresponds to a specific dataset with information such as an identifier for the row, the file path to the assembly (assembly), and paths to the raw reads files (fastq_1 and fastq_2). Additionally, the assembly_accession column contains associated assembly accessions.

id	assembly	fastq_1	fastq_2	assembly_accession
SRR1631112	/path/to/ERZ1031893.fasta	/path/to/SRR1631112_1.fastq.gz	/path/to/SRR1631112_2.fastq.gz	ERZ1031893

There is an example here.

assembly_software.tsv

ID: run accession
Assembly_software: tool that was used to assemble run into assembly.

id	assembly_software
SRR1631112	Assembler_vVersion

Metagenome

Manually choose the most appropriate metagenome from https://www.ebi.ac.uk/ena/browser/view/408169?show=tax-tree.
For example, marine metagenome

Environment information

Comma-separated environment parameters in format: "environment_biome,environment_feature,environment_material"
For example, marine sediments,subtropical gyre,sinking marine particle

Run pipeline

nextflow run ebi-metagenomics/genomes-generation \
-profile `specify profile(s)` \
--input `samplesheet.csv` \
--assembly_software_file `software.tsv` \
--metagenome "chosen metagenome" \
--biomes "chosen biome,chosen feature,chosen material" \
--outdir `full path to output directory`

Optional arguments

--xlarge (default=false): use high-memory config for big studies. Study maybe considered as big if it has more than 300 runs. In addition, if study has less number of runs but they are very deeply sequenced it also makes sense to try that option.
--skip_preprocessing_input (default=false): skip input data pre-processing step that renames ERZ-fasta files to ERR-run accessions. Useful if you process data not from ENA
--skip_decontamination (default=false): skip decontamination on reference genome
--skip_prok (default=false): do not generate prokaryotic MAGs
--skip_euk (default=false): do not generate eukaryotic MAGs
--merge_pairs (default=false): merge paired-end reads on QC step with fastp

Pipeline results

Upload

Use final_table_for_uploader.tsv to upload your MAGs to ENA with uploader.

Example of final_table_for_uploader.tsv.

! Do not modify existing output structure because that TSV file contains full paths to your genomes.

Outputs

For a more detailed description of the different output files, see the outputs file.

Citation

If you use this pipeline please make sure to cite all used software.

Name		Name	Last commit message	Last commit date
Latest commit History 682 Commits
assets		assets
bin		bin
config		config
containers		containers
docs		docs
lib		lib
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomes Generation Pipeline

Pipeline summary

Usage

Required reference databases

Tools and databases used in the pipeline

Pipeline inputs

samplesheet.csv

assembly_software.tsv

Metagenome

Environment information

Run pipeline

Optional arguments

Pipeline results

Upload

Outputs

Citation

About

Uh oh!

Releases 5

Packages

Contributors 9

Uh oh!

Languages

License

EBI-Metagenomics/genomes-generation

Folders and files

Latest commit

History

Repository files navigation

Genomes Generation Pipeline

Pipeline summary

Usage

Required reference databases

Tools and databases used in the pipeline

Pipeline inputs

samplesheet.csv

assembly_software.tsv

Metagenome

Environment information

Run pipeline

Optional arguments

Pipeline results

Upload

Outputs

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 9

Uh oh!

Languages

Packages