Skip to content

Potential proteins listed in results_table.csv and feature_table.txt are not labelled in the proteins.fa contig file #5

@smoorsh

Description

@smoorsh

Hi there! plasmidVerify has been a very helpful tool thus far, but I had an idea that could make it a little more helpful. In the feature_table.txt and result_table.csv, a list of potential proteins (HMMs) in my plasmid's fasta file are provided. The proteins.fa and genes.fa files include the coordinates and sequences for what I presume to be all of the protein sequences found in my input fasta file. It is my understanding that the protein sequences/coordinates provided in those files is all potential proteins, compared to the smaller list of proteins named in the results file. However, it would be really handy to have the exact coordinates/sequences of the proteins listed in the result_table file provided somewhere in the output so that I can find those sequences in the proteins.fa output file. If my results_table.csv looks like this:

Contig name,Prediction,Log-likelihood ratio,Predicted HMMs
Archangium_MIWBW_plasmid,Plasmid,12.37,ParB_N HNH SMODS

It would be great to have another output file that has something like this:
Predicted HMMs,Coordinates
ParB_N,24-1200
HNH,1201-2000
SMODS,2001-2400

etc etc for each predicted HMM, this way I can search for the exact protein sequence predicted to belong to that protein in my proteins.fa file.

Otherwise, modifying the genes.fa file to include a column stating if that gene was a predicted HMM and which HMM it was could be really helpful. Something like this:

";version=Prodigal.v2.6.3;run_type=Metagenomic;model="19|Erythrobacter_litoralis_HTCC2594|B|63.1|11|1";gc_cont=63.10;transl_table=11;uses_sd=1
FEATURES Location/Qualifiers
CDS complement(<1..99)
/note="ID=1_1;partial=10;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.727;conf=95.84;score=13.65;cscore=1.88;sscore=11.76;rscore=10.54;uscore=-1.77;tscore=3.65;HMM=ParB_N;"
CDS 212..1093
/note="ID=1_2;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=11-12bp;gc_cont=0.713;conf=100.00;score=112.87;cscore=108.82;sscore=4.05;rscore=-0.37;uscore=0.77;tscore=3.65;HMM=HNH;"
CDS 1430..2212
/note="ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.603;conf=99.46;score=22.67;cscore=4.56;sscore=18.12;rscore=13.75;uscore=1.37;tscore=3.65;HMM=SMODS;"

If this already exists and I am just missing it, please let me know. This feature would be really helpful for comparison analyses across prediction models.

I have been using blastp after the fact to try and identify which protein sequences go with what HMM, as the '--db' argument did not work when I tried adding it to my initial plasmidVerify run, but I am concerned about misidentifying the coordinates if there is more than one sequence that could belong to each HMM, and I want the exact coordinates each potential HMM was located at.

This is just a suggestion, but I figured it could be helpful to a lot of people. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions