Skip to content

LanderDC/pyrodigal-rv

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

79 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ”ฅ๐Ÿฆ  Pyrodigal-rv Stars

A Pyrodigal extension to predict genes in RNA viruses (with standard and alternative genetic code).

Actions License PyPI Bioconda

Wheel Python Versions Python Implementations Source GitHub issues Changelog Downloads

๐Ÿ—บ๏ธ Overview

Pyrodigal is a Python module that provides Cython bindings to Prodigal, an efficient gene finding method for genomes and metagenomes based on dynamic programming. Additionally, pyrodigal-gv is a small extension module for pyrodigal (both written by Martin Larralde) which distributes additional metagenomic models for giant viruses and viruses that use alternative genetic codes, first provided by Antรดnio Camargo in prodigal-gv.

Inspired by the additional metagenomic models for giant viruses and bacteriophages in pyrodigal-gv, pyrodigal-rv substitutes those metagenomic models and the bacterial models from pyrodigal for metagenomic models from RNA viruses which mostly use the standard genetic code (translation table 1), but also include RNA virus models with alternative genetic codes.

Important

Although pyrodigal-rv seems to perform well from benchmarking on Riboviria RefSeq sequences, the chosen model for gene prediction is by no means an indication of the sequence's taxonomy. In addition, pyrodigal-rv might pick in a minority of cases the wrong translation table as in pyrodigal there is no difference between translation table 1 and 11 (they use the same start and stop codons). Therefore, caution is warranted when pyrodigal-rv gives you a translation table that would not match what you would expect based on the sequence's taxonomy. For example, the Spinareoviridae (see here) seem to be affected by this.

See below for which viral families and which genetic codes are included. The process of model generation is documented in a separate repo.

Code and instructions below are exactly the same as for pyrodigal-gv.

๐Ÿ”ง Installing

pyrodigal-rv can be installed directly from PyPI as a universal wheel that contains all required data files:

$ pip install pyrodigal-rv

Otherwise, pyrodigal-rv is also available as a Bioconda package:

$ conda install -c bioconda pyrodigal-rv

๐Ÿ’ก Example

Just use the provided ViralGeneFinder class instead of the usual GeneFinder from pyrodigal, and the new viral models will be used automatically in meta mode:

import Bio.SeqIO
import pyrodigal_rv

record = Bio.SeqIO.read("sequence.gbk", "genbank")

orf_finder = pyrodigal_rv.ViralGeneFinder(meta=True)
for i, pred in enumerate(orf_finder.find_genes(bytes(record.seq))):
    print(f">{record.id}_{i+1}")
    print(pred.translate())

ViralGeneFinder has an additional keyword argument, viral_only, which can be set to True to run gene calling using only viral models.

๐Ÿ”จ Command line

pyrodigal-rv comes with a very simple command line similar to Prodigal and pyrodigal:

$ pyrodigal-rv -i <input_file.fasta> -a <gene_translations.fasta> -d <gene_sequences.fasta>

Contrary to prodigal and pyrodigal, the pyrodigal-rv script runs in meta mode by default! Running in single mode can be done with pyrodigal-rv -p single but the results will be exactly the same as pyrodigal, so why would you ever do this โ‰๏ธ

๐Ÿ“Š Benchmarking

The benchmarking is documented in this repo.

Accuracy

To evaluate pyrodigal-rv ORF prediction in RNA viruses, all Riboviria sequences in RefSeq indicated as "complete" by the sequence submission authors and without N's in the sequence were used as a benchmark (n=9,001).

All tools were run in closed mode (-c) and pyrodigal was forced to use genetic code 1 (-g 1) for the benchmarking as this is the most used genetic code by RNA viruses. After comparison with the CDS annotations from RefSeq, pyrodigal and pyrodigal-rv give 58.9% and 49.4% exact matches respectively, while both of them also predicted ~25% CDSs with different start and/or stop sites compared to RefSeq. For pyrodigal-rv another 12.4% was predicted to only have a different translation table.

As expected pyrodigal-gv had almost no exact matches because it contains no metagenomic models with genetic code 1, and it also predicts 28.8% CDSs with different start/stop sites (4.6% higher than pyrodigal-rv).

pyrodigal-rv also performed best in context of extra and missing CDS predictions (considerably lower amount extra predictions and only 0.4% more missing predictions compared to pyrodigal).

pyrodigal-rv adds the ability to predict the right genetic code for your RNA virus sequence, when comparing to RefSeq, 11.7% of the sequences had a mismatch in genetic code. However, when examining more closely the majority of these sequences belong to the Atkinsviridae, Blumeviridae, Fiersviridae, Solspiviridae and Steitzviridae, which are RNA phages and should use the bacterial genetic code 11 (as predicted by pyrodigal-rv). This shows that not all sequences in RefSeq are annotated with the correct translation table and this benchmark underestimated pyrodigal-rv's accuracy in number of exact matches.

Disclaimer: The training models for pyrodigal-rv contain some RefSeq sequences.

Speed

CLI speed was benchmarked with hyperfine over 10 runs of the same command on 9,000 sequences for each CLI (pyrodigal, pyrodigal-gv and pyrodigal-rv) using 10 processes (-j 10 --pool process).

Command Mean [s] Min [s] Max [s] Relative
pyrodigal 63.883 ยฑ 0.597 63.402 65.288 2.19 ยฑ 0.03
pyrodigal-gv 29.568 ยฑ 0.563 28.250 30.286 1.01 ยฑ 0.02
pyrodigal-rv 29.150 ยฑ 0.199 28.860 29.540 1.00

๐Ÿ”– Citation

Pyrodigal is scientific software, with a published paper in the Journal of Open-Source Software. Please cite both Pyrodigal and Prodigal if you are using it in an academic work, for instance as:

Pyrodigal (Larralde, 2022), a Python library binding to Prodigal (Hyatt et al., 2010).

Detailed references are available on the Publications page of the online documentation.

๐Ÿ’ญ Feedback

โš ๏ธ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

๐Ÿ—๏ธ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

๐Ÿ“‹ Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

โš–๏ธ License

This library is provided under the GNU General Public License v3.0. The Prodigal code was written by Doug Hyatt and is distributed under the terms of the GPLv3 as well. See vendor/Prodigal/LICENSE for more information.

This project is in no way affiliated, sponsored, or otherwise endorsed by the original Prodigal authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team. RNA virus models were added by Lander De Coninck.

๐Ÿ“ Models

Click to see included models and genetic codes:
model parent_family name viral gc_content genetic_code uses_sd
1 Tymoviridae_1_model V 54.5 1 0
2 Picobirnaviridae_6_model V 43.5 6 0
3 Polymycoviridae_1_model V 57.7 1 0
4 Atkinsviridae_11_model V 49.0 11 1
5 Duinviridae_11_model V 43.6 11 1
6 Aspiviridae_1_model V 36.0 1 0
7 Narnaviridae_1_model V 50.5 1 0
8 Peribunyaviridae_1_model V 35.9 1 0
9 Nodaviridae_1_model V 49.5 1 0
10 Sedoreoviridae_1_model V 37.6 1 0
11 Narnaviridae_6_model V 51.3 6 0
12 Qinviridae_1_model V 46.6 1 0
13 Narnaviridae_4_model V 41.3 4 1
14 Tombusviridae_6_model V 51.5 6 1
15 Orthototiviridae_1_model V 48.3 1 0
16 Tombusviridae_16_model V 53.1 16 1
17 Carmotetraviridae_1_model V 50.7 1 0
18 Steitzviridae_11_model V 50.4 11 1
19 Picobirnaviridae_1_model V 42.0 1 1
20 Dicistroviridae_1_model V 40.7 1 0
21 Astroviridae_1_model V 45.9 1 0
22 Hepadnaviridae_1_model V 47.4 1 0
23 Tombusviridae_1_model V 49.8 1 0
24 Solspiviridae_11_model V 49.9 11 1
25 Cystoviridae_11_model V 51.3 11 1
26 Picobirnaviridae_5_model V 36.2 5 0
27 Blumeviridae_11_model V 45.2 11 1
28 Alphaormycoviridae_1_model V 44.6 1 0
29 Orthomyxoviridae_1_model V 40.0 1 0
30 Fiersviridae_4_model V 49.2 4 1
31 Flaviviridae_1_model V 42.1 1 0
32 Splipalmiviridae_1_model V 49.3 1 0
33 Picobirnaviridae_4_model V 43.0 4 0
34 Betaormycoviridae_1_model V 41.6 1 0
35 Tombusviridae_4_model V 48.4 4 0
36 Pseudototiviridae_1_model V 55.2 1 0
37 Fiersviridae_6_model V 48.7 6 0
38 Fimoviridae_1_model V 31.0 1 0
39 Botourmiaviridae_4_model V 44.8 4 1
40 Fiersviridae_11_model V 50.1 11 1
41 Yueviridae_1_model V 41.3 1 0
42 Dicistroviridae_6_model V 35.9 6 1
43 Spinareoviridae_1_model V 43.6 1 0
44 Matonaviridae_1_model V 61.3 1 0
45 Picornaviridae_1_model V 44.2 1 0
46 Caulimoviridae_1_model V 40.7 1 0
47 Barnaviridae_1_model V 50.7 1 0
48 Chrysoviridae_1_model V 49.1 1 0
49 Mitoviridae_16_model V 44.1 16 0
50 Picornaviridae_4_model V 48.3 4 0
51 Picornaviridae_6_model V 43.0 6 0
52 Partitiviridae_1_model V 44.9 1 0
53 Qinviridae_6_model V 47.8 6 1
54 Botourmiaviridae_1_model V 51.6 1 0
55 Potyviridae_1_model V 41.9 1 0
56 Fiersviridae_16_model V 50.0 16 0
57 Yadokariviridae_1_model V 45.6 1 0
58 Narnaviridae_16_model V 45.3 16 1
59 Flaviviridae Pestivirus_1_model V 45.0 1 0
60 Flaviviridae Pegivirus_1_model V 55.2 1 0
61 Mitoviridae Unuamitovirus_4_model V 35.9 4 0
62 Flaviviridae Hepacivirus_1_model V 55.0 1 0
63 Mitoviridae Duamitovirus_4_model V 41.4 4 0
64 Mitoviridae Triamitovirus_4_model V 39.9 4 0

About

A Pyrodigal extension to predict genes in RNA viruses.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%