Potential proteins listed in results_table.csv and feature_table.txt are not labelled in the proteins.fa contig file

Hi there! plasmidVerify has been a very helpful tool thus far, but I had an idea that could make it a little more helpful. In the feature_table.txt and result_table.csv, a list of potential proteins (HMMs) in my plasmid's fasta file are provided. The proteins.fa and genes.fa files include the coordinates and sequences for what I presume to be all of the protein sequences found in my input fasta file. It is my understanding that the protein sequences/coordinates provided in those files is all potential proteins, compared to the smaller list of proteins named in the results file. However, it would be really handy to have the exact coordinates/sequences of the proteins listed in the result_table file provided somewhere in the output so that I can find those sequences in the proteins.fa output file. If my results_table.csv looks like this:

Contig name,Prediction,Log-likelihood ratio,Predicted HMMs
Archangium_MIWBW_plasmid,Plasmid,12.37,ParB_N HNH SMODS

It would be great to have another output file that has something like this:
Predicted HMMs,Coordinates
ParB_N,24-1200
HNH,1201-2000
SMODS,2001-2400

etc etc for each predicted HMM, this way I can search for the exact protein sequence predicted to belong to that protein in my proteins.fa file.

Otherwise, modifying the genes.fa file to include a column stating if that gene was a predicted HMM and which HMM it was could be really helpful. Something like this:

";version=Prodigal.v2.6.3;run_type=Metagenomic;model="19|Erythrobacter_litoralis_HTCC2594|B|63.1|11|1";gc_cont=63.10;transl_table=11;uses_sd=1
FEATURES             Location/Qualifiers
     CDS             complement(<1..99)
                     /note="ID=1_1;partial=10;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.727;conf=95.84;score=13.65;cscore=1.88;sscore=11.76;rscore=10.54;uscore=-1.77;tscore=3.65;HMM=ParB_N;"
     CDS             212..1093
                     /note="ID=1_2;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=11-12bp;gc_cont=0.713;conf=100.00;score=112.87;cscore=108.82;sscore=4.05;rscore=-0.37;uscore=0.77;tscore=3.65;HMM=HNH;"
     CDS             1430..2212
                     /note="ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.603;conf=99.46;score=22.67;cscore=4.56;sscore=18.12;rscore=13.75;uscore=1.37;tscore=3.65;HMM=SMODS;"
    
If this already exists and I am just missing it, please let me know. This feature would be really helpful for comparison analyses across prediction models. 

I have been using blastp after the fact to try and identify which protein sequences go with what HMM, as the '--db' argument did not work when I tried adding it to my initial plasmidVerify run, but I am concerned about misidentifying the coordinates if there is more than one sequence that could belong to each HMM, and I want the exact coordinates each potential HMM was located at.

This is just a suggestion, but I figured it could be helpful to a lot of people. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential proteins listed in results_table.csv and feature_table.txt are not labelled in the proteins.fa contig file #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential proteins listed in results_table.csv and feature_table.txt are not labelled in the proteins.fa contig file #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions