Bacteria can acquire genetic material through horizontal gene transfer, allowing them to rapidly adapt to changing environmental conditions. These mobile genetic elements can be classified into three main categories: plasmids, phages, and integrative elements. Plasmids are mostly extrachromosmal; phages can be found extrachromosmal or as temperate phages (prophages); whereas integrons are stable inserted in the chromosome. Autonomous elements are those integrative elements capable of excising themselves from the chromosome and reintegrate elsewhere. They can use a transposase (like insertion sequences and transposons) or an integrase/excisionase (like ICEs and IMEs).
The Mobilome Annotation Pipeline is a wrapper that integrates the output of different tools designed for the prediction of plasmids, phages, insertion sequences, integrative mobile genetic elements such as ICEs, IMEs, integrons, and non-autonomous mobile genetic elements in prokaryotic genomes and metagenomes. The output is a GFF3 file with the mobilome annotation.
Note
This is an intermediate release of the MAP, positioned between two major development milestones.
We are currently developing a new subworkflow for gene-level annotation that will integrate multiple tools for antimicrobial resistance detection, virulence factor identification, and other ecologically relevant functions.
In the meantime, this version does not:
- Run PROKKA
- Run AMRFinderPlus
You can still provide as input a GFF file generated with your favourite annotation tool, which MAP will use to append the mobilome results.
This workflow has the following main blocks of analysis:
- Preprocessing: Rename, filter contigs and CDS annotation with Prodigal.
- Prediction: Run geNomad, ICEfinder2-lite, IntegronFinder, ISEScan, and the compositional outliers detection subworkflow (on contigs > 100kb).
- Integration: Parse and integrate. In this step optional results of VIRify v3.0.0 can be incorporated. MGEs <500 bp lengh and predictions with no genes are discarded.
- Postprocessing: Write the compressed mobilome gff file (
mobilome.gff.gz) and the mobilome fasta file (mobilome.fasta). The outputmobilome.gff.gzis validated as part of the postprocesing.
When user_proteins_gff is provided, three more GFF files will be generated:
user_mobilome_clean.gff.gz: mobilome + its associated CDSsuser_mobilome_extra.gff.gz: mobilome + ViPhOGs annotated genes (note that ViPhOG annotation is generated by VIRify)user_mobilome_full.gff.gz: mobilome + any other annotation on user GFF
The only prerequisites for running it are Nextflow and a container tool such as Docker or Singularity, since all tools use pre-built containers.
To get a copy of the Mobilome Annotation Pipeline, clone this repo by:
$ git clone https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline.gitThe first time you run the pipeline you will need to set up the following databases:
- Download and extract the geNomad database
wget https://zenodo.org/records/14886553/files/genomad_db_v1.9.tar.gz
tar -xvf genomad_db_v1.9.tar.gz- Download and extract the databases to run ICEfinder2-lite
wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/icefinder2lite/icf2_dbs.tar.gz
tar -xvf icf2_dbs.tar.gzOnce the downloading is complete, you can move the files to any suitable location. Then you should pass the paths to all your datbases to the pipeline as a parameter during execution using the corresponding flag. For instance:
nextflow run /PATH/mobilome-annotation-pipeline/main.nf --input samplesheet.csv --genomad_db /FULL/PATH/TO/genomad_db_v1.9Alternatively, we recomment to create a config file with the following paths and pass it to the pipeline during execution using -c my_paths.config
my_paths.config
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Config to store my DB paths and names
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/
params {
genomad_db = "/FULL/PATH/TO/genomad_db_v1.9"
icefinder_macsyfinder_models = "/FULL/PATH/TO/icf2_dbs/macsydata/"
icefinder_hmm_models = "/FULL/PATH/TO/icf2_dbs/icehmm/icescan.hmm"
icefinder_prokka_uniprot_db = "/FULL/PATH/TO/icf2_dbs/icefinder_prokka_uniprot/"
}
To run the Mobilome Annotation Pipeline on multiple samples, prepare a samplesheet with your input data that looks like the following example.
samplesheet.csv:
sample,assembly,user_proteins_gff,virify_gff
assembly,/PATH/assembly.fasta,,
assembly_proteins,/PATH/assembly.fasta,/PATH/proteins.gff,
assembly_proteins_virify,/PATH/assembly.fasta,/PATH/proteins.gff,/PATH/virify_out.gffEach row represents a sample. Note that sample names have to be unique. The minimal input is the (meta)genome assembly in FASTA format that can be compressed (.gz). User proteins GFF file also can be compressed (.gz).
Basic run:
$ nextflow run /PATH/mobilome-annotation-pipeline/main.nf --input samplesheet.csv -c my_paths.config -profile docker,singularityNote that the final output in gff format is created by integrating the predicted MGEs. If you have your own protein prediction files and want to append the mobilome annotation to it, provide the path to the uncompressed gff file in the samplesheet.csv. This file will be used to generate a user_mobilome_full.gff file containing the mobilome plus any feature existing in your original input.
Note that the virify output provided in the samplesheet under virify_gff has to be generated independently with VIRify version >=3.0.0 tool.
A summary of the mobilome annotation pipeline boolean flags:
gff_validation <True>: Validation step of the final gff file generated containig the mobile genetic elements only (mobilome_nogenes.gff).
publish_all <True>: Write the preprocessing and prediction results in the output folder. When set to `False`, only final outputs after integration will be writen.Results will be written by default in the results directory unless the --outdir option is used. There, you will find the following outputs:
sample/
├── sample_discarded_mge.txt
├── sample_mobilome.fasta
├── sample_overlap_report.txt
├── gff
│ ├── sample_mobilome.gff.gz
│ ├── sample_[user]_mobilome_clean.gff.gz
│ ├── sample_[user]_mobilome_extra.gff.gz
│ └── sample_[user]_mobilome_full.gff.gz
├── prediction
│ ├── genomad_results
│ │ ├── 5kb_contigs_plasmid_summary.tsv
│ │ └── 5kb_contigs_virus_summary.tsv
│ ├── icefinder_results
│ │ ├── sample_refined.tsv
│ │ └── sample_rejected.tsv
│ ├── integronfinder_results
│ │ ├── 5kb_contigs.summary
│ │ └── contig_1.gbk
│ ├── isescan_results
│ │ └── sample_1kb_contigs.fasta.tsv
│ ├── virify_filter
│ │ └── sample_virify_hq.gff
│ └── compositional_outliers_results
│ └── sample.merged.bed
└── preprocessing
├── sample_1kb_contigs.fasta
├── sample_5kb_contigs.fasta
├── sample_100kb_contigs.fasta
└── sample_contigID.mapIf --publish_all option is set to FALSE (default is TRUE), the output directory structure will look like:
sample/
├── sample_discarded_mge.txt
├── sample_mobilome.fasta
├── sample_overlap_report.txt
└── gff
├── sample_mobilome.gff.gz
├── sample_[user]_mobilome_clean.gff.gz
├── sample_[user]_mobilome_extra.gff.gz
└── sample_[user]_mobilome_full.gff.gzThe file discarded_mge.txt contains a list of predictions that were discarded, along with the reason for their exclusion. Possible reasons include:
- 'mge < 500bp' Discarded by length.
- 'no_cds' If there are no genes encoded in the prediction.
- 'tRNAs_in_window' There are tRNA genes in a compositional outlier
- 'CO_overlap_with_MGE' Compositional outlier overlapping other MGE
The file overlapping_integrons.txt is a report of long-MGEs with overlapping coordinates. No predictions are discarded in this case.
The mobilome prediction IDs are build as follows:
- Contig ID
- MGE type: flanking_site recombination_site prophage viral_sequence plasmid phage_plasmid integron conjugative_integron insertion_sequence compositional_outlier
- Start and end coordinates separated by ':'
Example:
>contig_id|mge_type-start:endWhen user_gff of -run_prokka, CDS with a coverage >= 0.9 in the boundaries of a predicted MGE is considered as part of the mobilome and labelled accordingly in the attributes field under the key location in the mobilome_full.gff output file.
The labels used in the Type column of the GFF file correspond to the following nomenclature according to the Sequence Ontology resource when possible:
| Type in gff file | Sequence ontology ID | Element description | Reporting tool |
|---|---|---|---|
| insertion_sequence | SO:0000973 | Insertion sequence | ISEScan |
| inverted_repeat_element | SO:0000481 | Inverted Repeat (IR) flanking insertion sequences or compositional outliers | ISEScan, MAP |
| integron | SO:0000365 | Integrative mobilizable element | IntegronFinder, ICEfinder |
| attC_site | SO:0000950 | Integration site of DNA integron | IntegronFinder |
| conjugative_integron | SO:0000371 | Integrative Conjugative Element | ICEfinder |
| direct_repeat | SO:0000314 | Flanking regions on mobilizable elements | ICEfinder, MAP |
| prophage | SO:0001006 | Temperate phage | geNomad, VIRify |
| viral_sequence | SO:0001041 | Viral genome fragment | geNomad, VIRify |
| plasmid | SO:0000155 | Plasmid | geNomad |
| compositional_outlier | Non-autonomous elements detected in contigs > 100 kb long | MAP |
Nextflow tests are executed with nf-test. It takes around 9 minutes to run.
Run:
$ cd mobilome-annotation-pipeline/
$ nf-test test --profile test,singularityThis project uses Task to manage common development tasks and uv for Python virtual environment management. The available tasks are:
# Setup and dependencies
$ task setup-venv # Bootstrap Python virtual environment with uv and install dependencies
# Testing
$ task test # Run all tests with pytest
$ task test-verbose # Run tests with verbose output
$ task test-coverage # Run tests with coverage report (generates HTML report)
$ task test-specific -- <pattern> # Run specific test file or pattern
# Utilities
$ task clean # Clean up virtual environmentTo see all available tasks, run:
$ task --listThis pipeline includes scripts derived from or inspired by ICEfinder2 algorithms and methods. The following scripts contain code adapted from ICEfinder2:
bin/ice_boundary_refinement.py- ICE boundary refinement and direct repeat processingbin/map_tools/icefinder_process.py- ICE result processing and data formattingbin/prescan_to_fasta.py- ICE prescanning and candidate detection methods
ICEfinder2 License: These algorithms are used under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
- Original work: ICEfinder2 (http://creativecommons.org/licenses/by-nc-sa/4.0/)
- Modifications: Licensed under Apache 2.0 by EMBL-EBI
The Mobilome Annotation Pipeline parses and integrates the output of the following tools and DBs sorted alphabetically:
- geNomad v1.11.1 Camargo et al., Nature Biotechnology, 2023
- ICEfinder v2.0 Wang et al., Nucleic Acids Res, 2024
- IntegronFinder2 v2.0.6 Néron et al., Microorganisms, 2022
- ISEScan v1.7.3 Xie et al., Bioinformatics, 2017
- Prodigal v2.6.3 Hyatt et al., Bioinformatics, 2010
- VIRify v3.0.0 Rangel-Pineros et al., PLoS Comput Biol, 2023

