- The VIRify pipeline
- Nextflow execution
- Pipeline overview
- Detour: Metatranscriptomics
- Frequently Asked Questions (FAQ)
- Resources
- Technical Details
- Citations
VIRify is a pipeline for the detection, annotation, and taxonomic classification of viral contigs in metagenomic and metatranscriptomic assemblies. The pipeline is part of the repertoire of analysis services offered by MGnify. VIRify's taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,013 orthologous protein domains and referred to as ViPhOGs.
The pipeline is implemented in Nextflow and additionally only Docker or Singularity are needed to run VIRify. Details about installation and usage are given below.
A Nextflow implementation of the VIRify pipeline.
This implementation of the pipeline runs with the workflow manager Nextflow and needs as second dependency either Docker or Singularity. Conda will be implemented soonish, hopefully (currently blocked bc/ we use PPR-Meta). However, we highly recommend in any way the usage of the stable containers. All other programs and databases are automatically downloaded by Nextflow.
Attention, the workflow will download the containers and databases with a size of roughly 19 GB (49 GB with --hmmextend and --blastextend) the first time it is executed!
curl -s https://get.nextflow.io | bash- for troubleshooting, see more instructions about Nextflow.
If you dont have experience with bioinformatic tools and their installation just copy the commands into your terminal to set everything up (local machine with full permissions!):
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo usermod -a -G docker $USER- restart your computer
- for troubleshooting, see more instructions about Docker
While singularity can be installed via Conda, we recommend setting up a true Singularity installation. For HPCs, ask the system administrator you trust. Here is also a good manual to get you started. Please note: you only need Docker or Singularity. However, due to security concerns it might not be possible to use Docker on your shared machine or HPC.
While it is possible to clone this repository and directly execute the virify.nf, but we recommend letting Nextflow handle the installation.
Get the pipeline code via:
nextflow pull EBI-Metagenomics/emg-viral-pipelineTest installation and get help:
nextflow run EBI-Metagenomics/emg-viral-pipeline --helpWe highly recommend to always run from a release:
nextflow run EBI-Metagenomics/emg-viral-pipeline -r v3.0.0 --helpCheck the release page to figure out the newest version of the pipelne. Or run:
nextflow info EBI-Metagenomics/emg-viral-pipelineThe pipeline accepts the assembly either using the --fasta parameter for a single one. Or the --samplesheet parameter to indicate a .csv that contains as many assemblies. The former allows for greater parallelization as one head nextflow job can run many assemblies at once.
Samplesheet
The samplesheet must be a .csv file that contains the following columns:
- id - Sample identifier (mandatory)
- assembly - Assembly file in FASTA format (optional)
- fastq_1 - FastQ file for reads 1 in '.fq.gz' or '.fastq.gz' format (optional)
- fastq_2 - FastQ file for reads 2 in '.fq.gz' or '.fastq.gz' format
- protein - Proteins file in FASTA format (optional)
The fastq_1 and fastq_2 files are optional and can be provided if the user wants the reads to be assembled. The proteins file is also optional and can be provided to avoid calling the protein caller again.
id,assembly,fastq_1,fastq_2,proteins
ERZ123,ERZ123.fasta,,,
Run annotation for a small assembly file (10 contigs, 0.78 Mbp) on your local Linux machine using Docker containers (per default --cores 4; takes approximately 10 min on a 8 core i7 laptop + time for database download; ~19 GB):
nextflow run EBI-Metagenomics/emg-viral-pipeline -r v3.0.0 \\
--fasta "/home/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" \\
--cores 4 -profile local,dockerPlease note that in particular the following parameters are important to handle where Nextflow writes files.
--workdiror-w(here your work directories with intermediate data will be saved)--databases(here your databases will be saved and the workflow checks if they are already available under this path)--singularity_cachedir(here Singularity containers will be cached, not needed for Docker, default path:./singularity)
Please clean up your work directory from time to time to save disk space!
Nextflow uses a merged profile handling system so you have to define an executor (e.g., local, lsf, slurm) and an engine (docker, singularity) to run the pipeline according to your needs and infrastructure.
Per default, the workflow runs locally (e.g., on your laptop) with Docker. When you execute the workflow on a HPC you can for example switch to a specific job scheduler and Singularity instead of Docker:
- SLURM (
-profile slurm,singularity) - LSF (
-profile lsf,singularity)
Don't forget, especially on an HPC, to define further important parameters such as -w, --databases, and --singularity_cachedir as mentioned above.
The engine conda is not working at the moment until there is a conda recipe for PPR-Meta or we switch the tool. Sorry. Use Docker. Or Singularity. Please. Or install PPR-Meta by yourself and then use the conda profile (not recommended).
To monitor your Nextflow computations, VIRify can be connected to Nextflow Tower. You need a user access token to connect your Tower account with the pipeline. Simply generate a login using your email and then click the link sent to this address.
Once logged in, click on your avatar in the top right corner and select "Your tokens." Generate a token or copy the default one and set the following environment variable:
export TOWER_ACCESS_TOKEN=<YOUR_COPIED_TOKEN>You can save this variable in your .bashrc or .profile to not need to enter it again. Refresh your terminal.
Now run:
nextflow run EBI-Metagenomics/emg-viral-pipeline -r v0.4.0 \\
--fasta "/home/$USER/.nextflow/assets/EBI-Metagenomics/emg-viral-pipeline/nextflow/test/assembly.fasta" \\
--cores 4 -profile local,docker \\
-with-towerAlternatively, you can also pull the code from this repository and activate the Tower connection within the nextflow.config file located in the root GitHub directory:
tower {
accessToken = ''
enabled = true
} You can also directly enter your access token here instead of generating the above-mentioned environment variable.
The outputs generated from viral prediction tools, ViPhOG annotation, taxonomy assign, and CheckV quality are integrated and summarized in a validated gff file.
By default pipeline produces 08-final folder with the following structure:
Structure example per-assembly
08-final
βββ annotation
βΒ Β βββ hmmer
βΒ Β βΒ Β βββ high_confidence_viral_contigs_prodigal_annotation.tsv
βΒ Β βΒ Β βββ low_confidence_viral_contigs_prodigal_annotation.tsv
βΒ Β βΒ Β βββ prophages_prodigal_annotation.tsv
βΒ Β βββ plot_contig_map
βΒ Β βββ high_confidence_viral_contigs_mapping_results
βΒ Β βΒ Β βββ high_confidence_viral_contigs_prot_ann_table_filtered.tsv
βΒ Β βΒ Β βββ plot_pdfs.tar.gz
βΒ Β βββ low_confidence_viral_contigs_mapping_results
βΒ Β βΒ Β βββ low_confidence_viral_contigs_prot_ann_table_filtered.tsv
βΒ Β βΒ Β βββ plot_pdfs.tar.gz
βΒ Β βββ prophages_mapping_results
βΒ Β βββ plot_pdfs.tar.gz
βΒ Β βββ prophages_prot_ann_table_filtered.tsv
βββ contigs
βΒ Β βββ high_confidence_viral_contigs_original.fasta
βΒ Β βββ low_confidence_viral_contigs_original.fasta
βΒ Β βββ prophages_original.fasta
βββ chromomap [optional step]
βββ gff
βΒ Β βββ ACCESSION_virify.gff
βββ krona
βΒ Β βββ ACCESSION.krona.html
βΒ Β βββ high_confidence_viral_contigs.krona.html
βΒ Β βββ low_confidence_viral_contigs.krona.html
βΒ Β βββ prophages.krona.html
βββ sankey
βββ ACCESSION.sankey.html
βββ high_confidence_viral_contigs.sankey.html
βββ low_confidence_viral_contigs.sankey.html
βββ prophages.sankey.html
In order to have expanded output with more files use --publish_all option in pipeline execution.
Expanded structure example per-assembly
βββ 01-predictions
βΒ Β βββ ACCESSION_virus_predictions.stats
βΒ Β βββ pprmeta
βΒ Β βΒ Β βββ ACCESSION_pprmeta.csv
βΒ Β βββ virfinder
βΒ Β βΒ Β βββ ACCESSION.txt
βΒ Β βββ virsorter2
βΒ Β βββ final-viral-boundary.tsv
βΒ Β βββ final-viral-combined.fa
βΒ Β βββ final-viral-score.tsv
βΒ Β βββ virsorter_metadata.tsv
βββ 02-prodigal
βΒ Β βββ high_confidence_viral_contigs_prodigal.faa
βΒ Β βββ low_confidence_viral_contigs_prodigal.faa
βΒ Β βββ prophages_prodigal.faa
βββ 03-hmmer
βΒ Β βββ high_confidence_viral_contigs_modified.tsv
βΒ Β βββ low_confidence_viral_contigs_modified.tsv
βΒ Β βββ prophages_modified.tsv
βΒ Β βββ ratio_evalue_tables
βΒ Β βΒ Β βββ high_confidence_viral_contigs_modified_informative.tsv
βΒ Β βΒ Β βββ low_confidence_viral_contigs_modified_informative.tsv
βΒ Β βΒ Β βββ prophages_modified_informative.tsv
βΒ Β βββ vpHMM_database_v3
βΒ Β βββ high_confidence_viral_contigs_vpHMM_database_v3_hmmsearch.tbl
βΒ Β βββ low_confidence_viral_contigs_vpHMM_database_v3_hmmsearch.tbl
βΒ Β βββ prophages_vpHMM_database_v3_hmmsearch.tbl
βΒ Β βββ [other chosen optional HMM DBs]
βββ 04-blast [optional step]
βββ 05-plots
βΒ Β βββ krona
βΒ Β βΒ Β βββ ACCESSION.krona.tsv
βΒ Β βΒ Β βββ high_confidence_viral_contigs.krona.tsv
βΒ Β βΒ Β βββ low_confidence_viral_contigs.krona.tsv
βΒ Β βΒ Β βββ prophages.krona.tsv
βΒ Β βββ sankey
βΒ Β βββ all.sankey.filtered-25.json
βΒ Β βββ all.sankey.tsv
βΒ Β βββ high_confidence_viral_contigs.sankey.filtered-25.json
βΒ Β βββ high_confidence_viral_contigs.sankey.tsv
βΒ Β βββ low_confidence_viral_contigs.sankey.filtered-25.json
βΒ Β βββ low_confidence_viral_contigs.sankey.tsv
βΒ Β βββ prophages.sankey.filtered-25.json
βΒ Β βββ prophages.sankey.tsv
βββ 06-taxonomy
βΒ Β βββ high_confidence_viral_contigs_prodigal_annotation_taxonomy.tsv
βΒ Β βββ low_confidence_viral_contigs_prodigal_annotation_taxonomy.tsv
βΒ Β βββ prophages_prodigal_annotation_taxonomy.tsv
βββ 07-checkv
βΒ Β βββ high_confidence_viral_contigs_quality_summary.tsv
βΒ Β βββ low_confidence_viral_contigs_quality_summary.tsv
βΒ Β βββ prophages_quality_summary.tsv
βββ 08-final
βββ annotation
βΒ Β βββ hmmer
βΒ Β βΒ Β βββ high_confidence_viral_contigs_prodigal_annotation.tsv
βΒ Β βΒ Β βββ low_confidence_viral_contigs_prodigal_annotation.tsv
βΒ Β βΒ Β βββ prophages_prodigal_annotation.tsv
βΒ Β βββ plot_contig_map
βΒ Β βββ high_confidence_viral_contigs_mapping_results
βΒ Β βΒ Β βββ high_confidence_viral_contigs_prot_ann_table_filtered.tsv
βΒ Β βΒ Β βββ plot_pdfs.tar.gz
βΒ Β βββ low_confidence_viral_contigs_mapping_results
βΒ Β βΒ Β βββ low_confidence_viral_contigs_prot_ann_table_filtered.tsv
βΒ Β βΒ Β βββ plot_pdfs.tar.gz
βΒ Β βββ prophages_mapping_results
βΒ Β βββ plot_pdfs.tar.gz
βΒ Β βββ prophages_prot_ann_table_filtered.tsv
βββ contigs
βΒ Β βββ high_confidence_viral_contigs_original.fasta
βΒ Β βββ low_confidence_viral_contigs_original.fasta
βΒ Β βββ prophages_original.fasta
βββ chromomap [optional step]
βββ gff
βΒ Β βββ ACCESSION_virify.gff
βββ krona
βΒ Β βββ ACCESSION.krona.html
βΒ Β βββ high_confidence_viral_contigs.krona.html
βΒ Β βββ low_confidence_viral_contigs.krona.html
βΒ Β βββ prophages.krona.html
βββ sankey
βββ ACCESSION.sankey.html
βββ high_confidence_viral_contigs.sankey.html
βββ low_confidence_viral_contigs.sankey.html
βββ prophages.sankey.html
You can find such output in the 08-final/gff/ folder.
The labels used in the Type column of the gff file correspond to the following nomenclature according to the Sequence Ontology resource:
| Type in gff file | Sequence ontology ID |
|---|---|
| viral_sequence | SO:0001041 |
| prophage | SO:0001006 |
| CDS | SO:0000316 |
Note that CDS are reported only when a ViPhOG match has been found.
For further details please check: doi.org/10.1101/2022.08.22.504484
Although VIRify has been benchmarked and validated with metagenomic data in mind, it is also possible to use this tool to detect RNA viruses in metatranscriptome assemblies (e.g. SARS-CoV-2). However, some additional considerations for this purpose are outlined below:
1. Quality control: As for metagenomic data, a thorough quality control of the FASTQ sequence reads to remove low-quality bases, adapters and host contamination (if appropriate) is required prior to assembly. This is especially important for metatranscriptomes as small errors can further decrease the quality and contiguity of the assembly obtained. We have used TrimGalore for this purpose.
2. Assembly: There are many assemblers available that are appropriate for either metagenomic or single-species transcriptomic data. However, to our knowledge, there is no assembler currently available specifically for metatranscriptomic data. From our preliminary investigations, we have found that transcriptome-specific assemblers (e.g. rnaSPAdes) generate more contiguous and complete metatranscriptome assemblies compared to metagenomic alternatives (e.g. MEGAHIT and metaSPAdes).
3. Post-processing: Metatranscriptomes generate highly fragmented assemblies. Therefore, filtering contigs based on a set minimum length has a substantial impact in the number of contigs processed in VIRify. It has also been observed that the number of false-positive detections of VirFinder (one of the tools included in VIRify) is lower among larger contigs. The choice of a length threshold will depend on the complexity of the sample and the sequencing technology used, but in our experience any contigs <2 kb should be analysed with caution.
4. Classification: The classification module of VIRify depends on the presence of a minimum number and proportion of phylogenetically-informative genes within each contig in order to confidently assign a taxonomic lineage. Therefore, short contigs typically obtained from metatranscriptome assemblies remain generally unclassified. For targeted classification of RNA viruses (for instance, to search for Coronavirus-related sequences), alternative DNA- or protein-based classification methods can be used. Two of the possible options are: (i) using MashMap to screen the VIRify contigs against a database of RNA viruses (e.g. Coronaviridae) or (ii) using hmmsearch to screen the proteins obtained in the VIRify contigs against marker genes of the taxon of interest.
gt gff3validator: error: token "ID" on line XXXX in file "SAMPLE_virify.gff" does not contain exactly one '='
Cause: This error typically occurs when FASTA headers contain special characters that interfere with GFF3 format requirements. Characters like hyphens (-), periods (.), and equals signs (=) in sequence identifiers can cause issues during the GFF validation step.
Example of problematic FASTA headers:
>k141_1615808-flag=1-multi=1.0000-len=1122
>contig-1.2=scaffold_01
Solution: Clean your FASTA headers before running VIRify by replacing problematic characters with underscores:
# Replace hyphens, periods, and equals signs with underscores
sed '/^>/ s/[-.=]/_/g' original.fasta > cleaned.fastaAdditional material (assemblies used for benchmarking in the paper, ...) as well as the ViPhOG HMMs with model-specific bit score thresholds used in VIRify are available at osf.io/fbrxy.
Here, we also list databases used and automatically downloaded by the pipeline (in v2.0.0) when it is first run. We deposited database files on a separate FTP to ensure their accessibility. The files can be also downloaded manually and then used as an input for the pipeline to prevent the auto-download (see --help in the Nextflow pipeline).
- ViPhOGs (mandatory, used for taxonomy assignment)
wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/vpHMM_database_v3.tar.gz- Additional metadata file for filtering the ViPhOGs (according to taxonomy updates by the ICTV)
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/additional_data_vpHMMs_v4.tsv
- Publication
- pVOGs (optional)
wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/pvogs.tar.gz- Publication
- RVDB (optional)
wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/rvdb.tar.gz- Publication
- VOGDB (optional)
wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/vogdb.tar.gz- Publication
- VPF (optional)
wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/hmmer_databases/vpf.tar.gz- Publication
- VirSorter HMMs
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/virsorter-data-v2.tar.gz- Publication
- Virfinder model
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/virfinder/VF.modEPV_k8.rda- Publication
- CheckV
wget https://portal.nersc.gov/CheckV/checkv-db-v1.0.tar.gz- Publication
- NCBI taxonomy
wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/2022-11-01_ete3_ncbi_tax.sqlite.gz
- IMG/VR
wget -nH ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/viral-pipeline/IMG_VR_2018-07-01_4.tar.gz- Publication
VIRify includes special handling for circular genome artifacts produced by VirSorter2 (VS2). When processing circular genomes, VS2 extends the annotation beyond the end of the contig to avoid truncating the gene annotation. This can result in prophage coordinates that exceed the original contig boundaries.
VIRify automatically detects and truncates prophage end coordinates that exceed contig lengths, while preserving the original prophage start coordinates.
For more details, see VirSorter2 issue #243.
Note that CheckV carries over the overhang end from VirSorter2, so be mindful of this when using the results. In addition, extended genes are also trimmed in the final output of VIRIfy.
If you use the pipeline or ViPhOG HMMs in your work, please cite accordingly:
ViPhOGs:
VIRify:


