Skip to content

EBI-Metagenomics/mobilome-annotation-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

414 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Run nf-tests for modules

Mobilome Annotation Pipeline (former MoMofy)

Bacteria can acquire genetic material through horizontal gene transfer, allowing them to rapidly adapt to changing environmental conditions. These mobile genetic elements can be classified into three main categories: plasmids, phages, and integrative elements. Plasmids are mostly extrachromosmal; phages can be found extrachromosmal or as temperate phages (prophages); whereas integrons are stable inserted in the chromosome. Autonomous elements are those integrative elements capable of excising themselves from the chromosome and reintegrate elsewhere. They can use a transposase (like insertion sequences and transposons) or an integrase/excisionase (like ICEs and IMEs).

The Mobilome Annotation Pipeline is a wrapper that integrates the output of different tools designed for the prediction of plasmids, phages, insertion sequences, integrative mobile genetic elements such as ICEs, IMEs, integrons, and non-autonomous mobile genetic elements in prokaryotic genomes and metagenomes. The output is a GFF3 file with the mobilome annotation.

Note

This is an intermediate release of the MAP, positioned between two major development milestones.
We are currently developing a new subworkflow for gene-level annotation that will integrate multiple tools for antimicrobial resistance detection, virulence factor identification, and other ecologically relevant functions.

In the meantime, this version does not:

  1. Run PROKKA
  2. Run AMRFinderPlus

You can still provide as input a GFF file generated with your favourite annotation tool, which MAP will use to append the mobilome results.

Contents

Workflow

This workflow has the following main blocks of analysis:

  • Preprocessing: Rename, filter contigs and CDS annotation with Prodigal.
  • Prediction: Run geNomad, ICEfinder2-lite, IntegronFinder, ISEScan, and the compositional outliers detection subworkflow (on contigs > 100kb).
  • Integration: Parse and integrate. In this step optional results of VIRify v3.0.0 can be incorporated. MGEs <500 bp lengh and predictions with no genes are discarded.
  • Postprocessing: Write the compressed mobilome gff file (mobilome.gff.gz) and the mobilome fasta file (mobilome.fasta). The output mobilome.gff.gz is validated as part of the postprocesing.

When user_proteins_gff is provided, three more GFF files will be generated:

  1. user_mobilome_clean.gff.gz: mobilome + its associated CDSs
  2. user_mobilome_extra.gff.gz: mobilome + ViPhOGs annotated genes (note that ViPhOG annotation is generated by VIRify)
  3. user_mobilome_full.gff.gz: mobilome + any other annotation on user GFF

Install and downloading dependencies

The only prerequisites for running it are Nextflow and a container tool such as Docker or Singularity, since all tools use pre-built containers.

To get a copy of the Mobilome Annotation Pipeline, clone this repo by:

$ git clone https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline.git

The first time you run the pipeline you will need to set up the following databases:

  1. Download and extract the geNomad database
wget https://zenodo.org/records/14886553/files/genomad_db_v1.9.tar.gz
tar -xvf genomad_db_v1.9.tar.gz
  1. Download and extract the databases to run ICEfinder2-lite
wget https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/icefinder2lite/icf2_dbs.tar.gz
tar -xvf icf2_dbs.tar.gz

Once the downloading is complete, you can move the files to any suitable location. Then you should pass the paths to all your datbases to the pipeline as a parameter during execution using the corresponding flag. For instance:

nextflow run /PATH/mobilome-annotation-pipeline/main.nf --input samplesheet.csv --genomad_db /FULL/PATH/TO/genomad_db_v1.9

Alternatively, we recomment to create a config file with the following paths and pass it to the pipeline during execution using -c my_paths.config

my_paths.config

/*
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     Config to store my DB paths and names
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

params {
    genomad_db                   = "/FULL/PATH/TO/genomad_db_v1.9"
    icefinder_macsyfinder_models = "/FULL/PATH/TO/icf2_dbs/macsydata/"
    icefinder_hmm_models         = "/FULL/PATH/TO/icf2_dbs/icehmm/icescan.hmm"
    icefinder_prokka_uniprot_db  = "/FULL/PATH/TO/icf2_dbs/icefinder_prokka_uniprot/"
}

Inputs

To run the Mobilome Annotation Pipeline on multiple samples, prepare a samplesheet with your input data that looks like the following example.

samplesheet.csv:

sample,assembly,user_proteins_gff,virify_gff
assembly,/PATH/assembly.fasta,,
assembly_proteins,/PATH/assembly.fasta,/PATH/proteins.gff,
assembly_proteins_virify,/PATH/assembly.fasta,/PATH/proteins.gff,/PATH/virify_out.gff

Each row represents a sample. Note that sample names have to be unique. The minimal input is the (meta)genome assembly in FASTA format that can be compressed (.gz). User proteins GFF file also can be compressed (.gz).

Basic run:

$ nextflow run /PATH/mobilome-annotation-pipeline/main.nf --input samplesheet.csv -c my_paths.config -profile docker,singularity

Note that the final output in gff format is created by integrating the predicted MGEs. If you have your own protein prediction files and want to append the mobilome annotation to it, provide the path to the uncompressed gff file in the samplesheet.csv. This file will be used to generate a user_mobilome_full.gff file containing the mobilome plus any feature existing in your original input.

Note that the virify output provided in the samplesheet under virify_gff has to be generated independently with VIRify version >=3.0.0 tool.

A summary of the mobilome annotation pipeline boolean flags:

gff_validation <True>:	Validation step of the final gff file generated containig the mobile genetic elements only (mobilome_nogenes.gff).	
publish_all    <True>:	Write the preprocessing and prediction results in the output folder. When set to `False`, only final outputs after integration will be writen.

Outputs

Results will be written by default in the results directory unless the --outdir option is used. There, you will find the following outputs:

sample/
├── sample_discarded_mge.txt
├── sample_mobilome.fasta
├── sample_overlap_report.txt
├── gff
│   ├── sample_mobilome.gff.gz
│   ├── sample_[user]_mobilome_clean.gff.gz
│   ├── sample_[user]_mobilome_extra.gff.gz
│   └── sample_[user]_mobilome_full.gff.gz
├── prediction
│   ├── genomad_results
│   │   ├── 5kb_contigs_plasmid_summary.tsv
│   │   └── 5kb_contigs_virus_summary.tsv
│   ├── icefinder_results
│   │   ├── sample_refined.tsv
│   │   └── sample_rejected.tsv
│   ├── integronfinder_results
│   │   ├── 5kb_contigs.summary
│   │   └── contig_1.gbk
│   ├── isescan_results
│   │   └── sample_1kb_contigs.fasta.tsv
│   ├── virify_filter
│   │   └── sample_virify_hq.gff
│   └── compositional_outliers_results
│       └── sample.merged.bed
└── preprocessing
    ├── sample_1kb_contigs.fasta
    ├── sample_5kb_contigs.fasta
    ├── sample_100kb_contigs.fasta
    └── sample_contigID.map

If --publish_all option is set to FALSE (default is TRUE), the output directory structure will look like:

sample/
├── sample_discarded_mge.txt
├── sample_mobilome.fasta
├── sample_overlap_report.txt
└── gff
    ├── sample_mobilome.gff.gz
    ├── sample_[user]_mobilome_clean.gff.gz
    ├── sample_[user]_mobilome_extra.gff.gz
    └── sample_[user]_mobilome_full.gff.gz

The file discarded_mge.txt contains a list of predictions that were discarded, along with the reason for their exclusion. Possible reasons include:

  1. 'mge < 500bp' Discarded by length.
  2. 'no_cds' If there are no genes encoded in the prediction.
  3. 'tRNAs_in_window' There are tRNA genes in a compositional outlier
  4. 'CO_overlap_with_MGE' Compositional outlier overlapping other MGE

The file overlapping_integrons.txt is a report of long-MGEs with overlapping coordinates. No predictions are discarded in this case.

The mobilome prediction IDs are build as follows:

  1. Contig ID
  2. MGE type: flanking_site recombination_site prophage viral_sequence plasmid phage_plasmid integron conjugative_integron insertion_sequence compositional_outlier
  3. Start and end coordinates separated by ':'

Example:

>contig_id|mge_type-start:end

When user_gff of -run_prokka, CDS with a coverage >= 0.9 in the boundaries of a predicted MGE is considered as part of the mobilome and labelled accordingly in the attributes field under the key location in the mobilome_full.gff output file.

The labels used in the Type column of the GFF file correspond to the following nomenclature according to the Sequence Ontology resource when possible:

Type in gff file Sequence ontology ID Element description Reporting tool
insertion_sequence SO:0000973 Insertion sequence ISEScan
inverted_repeat_element SO:0000481 Inverted Repeat (IR) flanking insertion sequences or compositional outliers ISEScan, MAP
integron SO:0000365 Integrative mobilizable element IntegronFinder, ICEfinder
attC_site SO:0000950 Integration site of DNA integron IntegronFinder
conjugative_integron SO:0000371 Integrative Conjugative Element ICEfinder
direct_repeat SO:0000314 Flanking regions on mobilizable elements ICEfinder, MAP
prophage SO:0001006 Temperate phage geNomad, VIRify
viral_sequence SO:0001041 Viral genome fragment geNomad, VIRify
plasmid SO:0000155 Plasmid geNomad
compositional_outlier Non-autonomous elements detected in contigs > 100 kb long MAP

Tests

Nextflow tests are executed with nf-test. It takes around 9 minutes to run.

Run:

$ cd mobilome-annotation-pipeline/
$ nf-test test --profile test,singularity

Development Tasks

This project uses Task to manage common development tasks and uv for Python virtual environment management. The available tasks are:

# Setup and dependencies
$ task setup-venv          # Bootstrap Python virtual environment with uv and install dependencies

# Testing
$ task test                # Run all tests with pytest
$ task test-verbose        # Run tests with verbose output
$ task test-coverage       # Run tests with coverage report (generates HTML report)
$ task test-specific -- <pattern>  # Run specific test file or pattern

# Utilities
$ task clean               # Clean up virtual environment

To see all available tasks, run:

$ task --list

License and Attribution

ICEfinder2 Attribution

This pipeline includes scripts derived from or inspired by ICEfinder2 algorithms and methods. The following scripts contain code adapted from ICEfinder2:

  • bin/ice_boundary_refinement.py - ICE boundary refinement and direct repeat processing
  • bin/map_tools/icefinder_process.py - ICE result processing and data formatting
  • bin/prescan_to_fasta.py - ICE prescanning and candidate detection methods

ICEfinder2 License: These algorithms are used under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Citation

The Mobilome Annotation Pipeline parses and integrates the output of the following tools and DBs sorted alphabetically:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Contributors 8