Fix genetic code/translation table management#367
Open
JeanMainguy wants to merge 8 commits intodevfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Translation table was not correctly managed. When using annotation files, the translation table is parsed and saved for each CDS in the
genedatatable of the HDF5 file but was not reused later in the cluster step (the table specified by user or default was used instead). Many commands rely on the translation table but users were supposed to specify it each time as a parameter rather than using the one that was used to construct the pangenome.Implementation
Added tracking of user-specified arguments:
Added a
specified_argsattribute to the args object that lists arguments explicitly set by the user. This allows distinguishing when an argument has been specified vs using a default value.Pangenome-level genetic code:
PPanGGOLiN expects genomes in a pangenome to have the same genetic code, so a unique genetic code is determined at the pangenome level.
For annotation files (GFF, GBFF): Translation table is specified for each CDS in the genome files. This information is kept for each gene. To determine the table to use at the pangenome level, the most abundant one is determined. If more than one table is found, a warning is issued as this is not expected.
New behavior:
Storage:
After the annotation step, the translation table used is stored in:
pangenome.status["translation_table"]for easy reuse in other stepspangenome.parameters["annotate"]["translation_table"]Extra info is added to parameters (prefixed with
#so parameters can still be used as a config file):# is_translation_table_user_specified: whether user explicitly set the value# translation_table_from_annotation_files: the value parsed from annotation files (if applicable)Example:
Commands affected:
For commands that need translation table information, the following priority is used:
pangenome.status["translation_table"](defined during annotation)Commands now using the stored translation table:
Projection command:
Previously had no
--translation_tableargument and was using the one found in cluster parameters. For consistency, the argument has been added to its command line and same treatment is applied (use the one from status if not specified by user).Updated help messages:
Updated
--translation_tablehelp text across all commands to explain the new behavior and suggest usingppanggolin infoto check the current value.Note
Translation table and genetic code are used interchangeably in the code and mean the same thing. To prevent any breakage in the API, these terms were not homogenized.