Mobilome Annotation Pipeline (former MoMofy)

Bacteria can acquire genetic material through horizontal gene transfer, allowing them to rapidly adapt to changing environmental conditions. These mobile genetic elements can be classified into three main categories: plasmids, phages, and integrative elements. Plasmids are mostly extrachromosmal; phages can be found extrachromosmal or as temperate phages (prophages); whereas integrons are stable inserted in the chromosome. Autonomous elements are those integrative elements capable of excising themselves from the chromosome and reintegrate elsewhere. They can use a transposase (like insertion sequences and transposons) or an integrase/excisionase (like ICEs and IMEs).

The Mobilome Annotation Pipeline is a wrapper that integrates the output of different tools designed for the prediction of plasmids, phages, insertion sequences, integrative mobile genetic elements such as ICEs, IMEs, integrons, and non-autonomous mobile genetic elements in prokaryotic genomes and metagenomes. The output is a GFF3 file with the mobilome annotation.

Note

This is an intermediate release of the MAP, positioned between two major development milestones.
We are currently developing a new subworkflow for gene-level annotation that will integrate multiple tools for antimicrobial resistance detection, virulence factor identification, and other ecologically relevant functions.

In the meantime, this version does not:

Run PROKKA
Run AMRFinderPlus

You can still provide as input a GFF file generated with your favourite annotation tool, which MAP will use to append the mobilome results.

Workflow

This workflow has the following main blocks of analysis:

Preprocessing: Rename, filter contigs and CDS annotation with Prodigal.
Prediction: Run geNomad, ICEfinder2-lite, IntegronFinder, ISEScan, and the compositional outliers detection subworkflow (on contigs > 100kb).
Integration: Parse and integrate. In this step optional results of VIRify v3.0.0 can be incorporated. MGEs <500 bp lengh and predictions with no genes are discarded.
Postprocessing: Write the compressed mobilome gff file (mobilome.gff.gz) and the mobilome fasta file (mobilome.fasta). The output mobilome.gff.gz is validated as part of the postprocesing.

When user_proteins_gff is provided, three more GFF files will be generated:

user_mobilome_clean.gff.gz: mobilome + its associated CDSs
user_mobilome_extra.gff.gz: mobilome + ViPhOGs annotated genes (note that ViPhOG annotation is generated by VIRify)
user_mobilome_full.gff.gz: mobilome + any other annotation on user GFF

Install and downloading dependencies

The only prerequisites for running it are Nextflow and a container tool such as Docker or Singularity, since all tools use pre-built containers.

To get a copy of the Mobilome Annotation Pipeline, clone this repo by:

$ git clone https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline.git

The first time you run the pipeline you will need to set up the following databases:

Download and extract the geNomad database

wget https://zenodo.org/records/14886553/files/genomad_db_v1.9.tar.gz
tar -xvf genomad_db_v1.9.tar.gz

Download and extract the databases to run ICEfinder2-lite

wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/icefinder2lite/icf2_dbs.tar.gz
tar -xvf icf2_dbs.tar.gz

Once the downloading is complete, you can move the files to any suitable location. Then you should pass the paths to all your datbases to the pipeline as a parameter during execution using the corresponding flag. For instance:

nextflow run /PATH/mobilome-annotation-pipeline/main.nf --input samplesheet.csv --genomad_db /FULL/PATH/TO/genomad_db_v1.9

Alternatively, we recomment to create a config file with the following paths and pass it to the pipeline during execution using -c my_paths.config

my_paths.config

/*
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     Config to store my DB paths and names
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

params {
    genomad_db                   = "/FULL/PATH/TO/genomad_db_v1.9"
    icefinder_macsyfinder_models = "/FULL/PATH/TO/icf2_dbs/macsydata/"
    icefinder_hmm_models         = "/FULL/PATH/TO/icf2_dbs/icehmm/icescan.hmm"
    icefinder_prokka_uniprot_db  = "/FULL/PATH/TO/icf2_dbs/icefinder_prokka_uniprot/"
}

Inputs

To run the Mobilome Annotation Pipeline on multiple samples, prepare a samplesheet with your input data that looks like the following example.

samplesheet.csv:

sample,assembly,user_proteins_gff,virify_gff
assembly,/PATH/assembly.fasta,,
assembly_proteins,/PATH/assembly.fasta,/PATH/proteins.gff,
assembly_proteins_virify,/PATH/assembly.fasta,/PATH/proteins.gff,/PATH/virify_out.gff

Each row represents a sample. Note that sample names have to be unique. The minimal input is the (meta)genome assembly in FASTA format that can be compressed (.gz). User proteins GFF file also can be compressed (.gz).

Basic run:

$ nextflow run /PATH/mobilome-annotation-pipeline/main.nf --input samplesheet.csv -c my_paths.config -profile docker,singularity

Note that the final output in gff format is created by integrating the predicted MGEs. If you have your own protein prediction files and want to append the mobilome annotation to it, provide the path to the uncompressed gff file in the samplesheet.csv. This file will be used to generate a user_mobilome_full.gff file containing the mobilome plus any feature existing in your original input.

Note that the virify output provided in the samplesheet under virify_gff has to be generated independently with VIRify version >=3.0.0 tool.

A summary of the mobilome annotation pipeline boolean flags:

gff_validation <True>:	Validation step of the final gff file generated containig the mobile genetic elements only (mobilome_nogenes.gff).	
publish_all    <True>:	Write the preprocessing and prediction results in the output folder. When set to `False`, only final outputs after integration will be writen.

Outputs

Results will be written by default in the results directory unless the --outdir option is used. There, you will find the following outputs:

sample/
├── sample_discarded_mge.txt
├── sample_mobilome.fasta
├── sample_overlap_report.txt
├── gff
│   ├── sample_mobilome.gff.gz
│   ├── sample_[user]_mobilome_clean.gff.gz
│   ├── sample_[user]_mobilome_extra.gff.gz
│   └── sample_[user]_mobilome_full.gff.gz
├── prediction
│   ├── genomad_results
│   │   ├── 5kb_contigs_plasmid_summary.tsv
│   │   └── 5kb_contigs_virus_summary.tsv
│   ├── icefinder_results
│   │   ├── sample_refined.tsv
│   │   └── sample_rejected.tsv
│   ├── integronfinder_results
│   │   ├── 5kb_contigs.summary
│   │   └── contig_1.gbk
│   ├── isescan_results
│   │   └── sample_1kb_contigs.fasta.tsv
│   ├── virify_filter
│   │   └── sample_virify_hq.gff
│   └── compositional_outliers_results
│       └── sample.merged.bed
└── preprocessing
    ├── sample_1kb_contigs.fasta
    ├── sample_5kb_contigs.fasta
    ├── sample_100kb_contigs.fasta
    └── sample_contigID.map

If --publish_all option is set to FALSE (default is TRUE), the output directory structure will look like:

sample/
├── sample_discarded_mge.txt
├── sample_mobilome.fasta
├── sample_overlap_report.txt
└── gff
    ├── sample_mobilome.gff.gz
    ├── sample_[user]_mobilome_clean.gff.gz
    ├── sample_[user]_mobilome_extra.gff.gz
    └── sample_[user]_mobilome_full.gff.gz

The file discarded_mge.txt contains a list of predictions that were discarded, along with the reason for their exclusion. Possible reasons include:

'mge < 500bp' Discarded by length.
'no_cds' If there are no genes encoded in the prediction.
'tRNAs_in_window' There are tRNA genes in a compositional outlier
'CO_overlap_with_MGE' Compositional outlier overlapping other MGE

The file overlapping_integrons.txt is a report of long-MGEs with overlapping coordinates. No predictions are discarded in this case.

The mobilome prediction IDs are build as follows:

Contig ID
MGE type: flanking_site recombination_site prophage viral_sequence plasmid phage_plasmid integron conjugative_integron insertion_sequence compositional_outlier
Start and end coordinates separated by ':'

Example:

>contig_id|mge_type-start:end

When user_gff of -run_prokka, CDS with a coverage >= 0.9 in the boundaries of a predicted MGE is considered as part of the mobilome and labelled accordingly in the attributes field under the key location in the mobilome_full.gff output file.

The labels used in the Type column of the GFF file correspond to the following nomenclature according to the Sequence Ontology resource when possible:

Type in gff file	Sequence ontology ID	Element description	Reporting tool
insertion_sequence	SO:0000973	Insertion sequence	ISEScan
inverted_repeat_element	SO:0000481	Inverted Repeat (IR) flanking insertion sequences or compositional outliers	ISEScan, MAP
integron	SO:0000365	Integrative mobilizable element	IntegronFinder, ICEfinder
attC_site	SO:0000950	Integration site of DNA integron	IntegronFinder
conjugative_integron	SO:0000371	Integrative Conjugative Element	ICEfinder
direct_repeat	SO:0000314	Flanking regions on mobilizable elements	ICEfinder, MAP
prophage	SO:0001006	Temperate phage	geNomad, VIRify
viral_sequence	SO:0001041	Viral genome fragment	geNomad, VIRify
plasmid	SO:0000155	Plasmid	geNomad
compositional_outlier		Non-autonomous elements detected in contigs > 100 kb long	MAP

Tests

Nextflow tests are executed with nf-test. It takes around 9 minutes to run.

Run:

$ cd mobilome-annotation-pipeline/
$ nf-test test --profile test,singularity

Development Tasks

This project uses Task to manage common development tasks and uv for Python virtual environment management. The available tasks are:

# Setup and dependencies
$ task setup-venv          # Bootstrap Python virtual environment with uv and install dependencies

# Testing
$ task test                # Run all tests with pytest
$ task test-verbose        # Run tests with verbose output
$ task test-coverage       # Run tests with coverage report (generates HTML report)
$ task test-specific -- <pattern>  # Run specific test file or pattern

# Utilities
$ task clean               # Clean up virtual environment

To see all available tasks, run:

$ task --list

License and Attribution

ICEfinder2 Attribution

This pipeline includes scripts derived from or inspired by ICEfinder2 algorithms and methods. The following scripts contain code adapted from ICEfinder2:

bin/ice_boundary_refinement.py - ICE boundary refinement and direct repeat processing
bin/map_tools/icefinder_process.py - ICE result processing and data formatting
bin/prescan_to_fasta.py - ICE prescanning and candidate detection methods

ICEfinder2 License: These algorithms are used under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Original work: ICEfinder2 (http://creativecommons.org/licenses/by-nc-sa/4.0/)
Modifications: Licensed under Apache 2.0 by EMBL-EBI

Citation

The Mobilome Annotation Pipeline parses and integrates the output of the following tools and DBs sorted alphabetically:

geNomad v1.11.1 Camargo et al., Nature Biotechnology, 2023
ICEfinder v2.0 Wang et al., Nucleic Acids Res, 2024
IntegronFinder2 v2.0.6 Néron et al., Microorganisms, 2022
ISEScan v1.7.3 Xie et al., Bioinformatics, 2017
Prodigal v2.6.3 Hyatt et al., Bioinformatics, 2010
VIRify v3.0.0 Rangel-Pineros et al., PLoS Comput Biol, 2023

Name		Name	Last commit message	Last commit date
Latest commit History 414 Commits
.github/workflows		.github/workflows
assets		assets
bin		bin
conf		conf
media		media
modules		modules
scripts		scripts
subworkflows		subworkflows
templates		templates
tests		tests
workflows		workflows
.editorconf		.editorconf
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
LICENSE		LICENSE
README.md		README.md
Taskfile.yaml		Taskfile.yaml
format_mobileOG.nf		format_mobileOG.nf
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mobilome Annotation Pipeline (former MoMofy)

Contents

Workflow

Install and downloading dependencies

Inputs

Outputs

Tests

Development Tasks

License and Attribution

ICEfinder2 Attribution

Citation

About

Uh oh!

Releases 11

Uh oh!

Contributors 8

Uh oh!

Languages

License

EBI-Metagenomics/mobilome-annotation-pipeline

Folders and files

Latest commit

History

Repository files navigation

Mobilome Annotation Pipeline (former MoMofy)

Contents

Workflow

Install and downloading dependencies

Inputs

Outputs

Tests

Development Tasks

License and Attribution

ICEfinder2 Attribution

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Uh oh!

Contributors 8

Uh oh!

Languages