-
Notifications
You must be signed in to change notification settings - Fork 5
feat: implement a new validate command
#220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This base is still missing some features Still not implemented as a cli switch Refer to #47
Supports many more features Simpler codebase Straight to the point Logs many more errors Still in development: missing some features
Missing cleanup + checking against genotype file
Further Tasks: - Testing - Optimizing - Bug-catching
validate-hapfile commandvalidate-hapfile command
|
mad props, @ayimany ! This is a very well written PR. The code is super clean and easy to follow. I'm excited to try it out and will let you know once I finish reviewing Thanks again for doing this! This will tremendously help many users of haptools |
…ools into impl-validate-command
since we may want the validate command to validate other kinds of files besides hap files in the future
validate-hapfile commandvalidate command
validate commandvalidate command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for taking this project on, @ayimany. Your code is super thorough and well-considered.
I left some suggestions. Most of them are small things related to the logging or the tests.
In addition to the suggested changes, it might also be a good idea to write docstrings for all functions and classes. You can look at the sim_phenotype.py module for examples of how to do this. We follow the conventions of numpydoc outlined here:
https://numpydoc.readthedocs.io/en/latest/format.html#documenting-classes
| self.errc += 1 | ||
|
|
||
| for i in range( | ||
| HapFileValidator.KEY_HAPLOTYPE, HapFileValidator.KEY_VARIANT + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add a global variable that defines the line types (as I suggested in the comment above this one), can you also use the length of the variable instead of KEY_VARIANT here? That way, it'll be flexible if we add more line types.
| self, var_ids: list[str], underscores_to_semicolons: bool = False | ||
| ): | ||
| ids: set[tuple[str, Line]] = set() | ||
| for chrom, dt in self.vrids.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to tell you! In addition to checking whether 'Variant' IDs are in the PVAR file, we should also check that 'Repeat' IDs are in there.
We should probably create another parameter (called --repeat-pvar) to allow the user to specify a PVAR file for repeats
Also, we might need to verify that the files aren't file descriptors? Just in case someone is trying to use process substitution?
…ools into impl-validate-command
Resolves #47
Overview
This branch introduces the functionality of a new sub-command meant to validate the structure of
.hapfiles.Usage and Documentation
It requires no new dependencies.
It can be invoked through:
haptools validate [--sorted/--not-sorted] [--genotypes <filename.pgen>] [--verbosity] <filename.hap>Implementation Details
The code in this implementation is concentrated in the
haptools/val_hapfile.pymodule.The classes and functions that make up this module are the following:
HapFileIO (Class)
This class consists of a set of methods made to validate the existence and readability of the provided
.hapfile. It also cleans up the file prior to reading it's content.It requires a filename and an optional logger. The filename should be a
.hapfile.The main method that should be used to verify if the file exists is
HapFileIO::validate_existence. It will check for the following:clickarguments already check for existence as well.The method that should be used to filter and read the content within the
.hapfile isHapFileIO::lines. It takes in a filename as aPathand a booleansortedto determine if the file should be sorted. It will:Lineobject per linelineSorting actually occurs wether
sortedisTrueor not. Whatsorteddoes is remove unnecessary lines based on their positions which would otherwise be necessary if sortedLine (Class)
Stores information about a line, such as it's number and content for future use.
HapFileValidator (Class)
The validator will have to use a
HapFileIOas its main source for reading the file's content. This is done through a method and not the constructor. (Which only accepts a logger)To load the data into the validator, use
HapFileValidator::extract_and_store_contentwhich takes in theHapFileIOto be used and and an optional boolean to determine whether the file should be sorted or not.All of the following methods in the class test for different aspects of the
.hapfile.validate_version_declarationsDetermines if the version declaration is present, repeated or invalid
validate_column_additionsDetermines if the extra column declarations are correctly formatted and well-formed. If so, they are added to the list of registered extra columns
validate_columns_fulfill_minreqsValidates if all columns fulfill the minimum requirements
validate_haplotypesValidates the haplotype row format
validate_repeatsValidates the repeat row format
validate_variantsValidates the variant row format
store_idsStores the IDs of each haplotype, repeat and variant for future use. Should be called before any ID validation methods
validate_variant_idsValidates the ID presence of each variant. Each needs to be unique per haplotype and not collide with chromosome IDs
validate_extra_fieldsMakes sure that the added extra fields conform to their addition signature
reorder_extra_fieldsParses the order[H|R|V] lines and reorders the extra fields if they are valid
compare_haps_to_pvarCompares the variants in the
.hapfile to those in the.pvarfileis_hapfile_valid (Function)
Performs all of the possible checks available in the
HapFileValidatorclass. Returns a boolean which isTruewhen there are no errors or warningsTests
Only one case hasn't been fully tested due to OS limitations and the way they handle their file permissions. I am talking about validating whether the user has enough permissions to read the
.hapfile.test_generated_haplotypesTests the dummy
.hapgenerated by thehaptoolstest suitetest_with_empty_linesTests a
.hapwith empty linestest_with_out_of_header_metas_sortedTest a sorted
.hapwith meta lines out of the headertest_with_out_of_header_metas_unsortedTest an unsorted
.hapwith meta lines out of the headertest_with_10_extras_reorderedTests a
.hapfile with 10 extra columnstest_with_unexistent_reordersTests a
.hapwith anorder[H|R|V]which mentions a non-existent extra columntest_with_unexistent_fieldsTests a
.hapwith a data line that is not anH,RorVtest_with_inadequate_versionTests a
.hapwith an incorrectly formatted versiontest_with_no_versionTests a
.hapwith no present versiontest_with_multiple_versionsTests a
.hapwith several versions presenttest_with_inadequate_version_columnsTests a
.hapwith a version column of only 2 fieldstest_with_invalid_column_addition_column_countTests a
.hapwith an extra column declaration of invalid column counttest_with_invalid_column_addition_typesTests a
.hapwith a column addition for a type which is notH,RorVtest_with_invalid_column_addition_data_typesTests a
.hapwith a column addition of unrecognized data type (nots,dor.nf)test_with_insufficient_columnsTests a
.hapwith insufficient mandatory columnstest_with_inconvertible_startsTests a
.hapwith start positions that can't be converted to integerstest_with_inconvertible_endsTests a
.hapwith end positions that can't be converted to integerstest_with_inconvertible_starts_varTests a
.hapwith start positions that can't be converted to integers in variantstest_with_inconvertible_ends_varTests a
.hapwith end positions that can't be converted to integers in variantstest_valhap_with_start_after_endTests a
.hapwith the start position placed after the end positiontest_is_directoryTests a validation command with a filename that points to a directory
test_with_variant_id_of_chromosomeTests a
.hapwith a variant whose ID is the same as a chromosome IDtest_with_hrid_of_chromosomeTests a
.hapwith a haplotype or repeat with the same ID as a chromosometest_with_unexistent_col_in_orderTests a
.hapwith anorder[H|R|V]field that references a non-existent extra column nametest_with_unassociated_haplotypeTests a
.hapwith a haplotype that does not have at least one matching repeattest_with_unrecognizable_alleleTests a
.hapwith a variant whose allele is notG,C,TorAtest_with_duplicate_idsTests a
.hapwith duplicate IDs forHandRfieldstest_with_duplicate_vids_per_haplotypeTests a
.hapwith duplicate IDs for variants with the same haplotype associationtest_with_excol_of_wrong_typeTests a
.hapwith a data line which contains an extra column ofddata type but receivesstest_with_multiple_order_defsTests a
.hapwith multipleorder[H|R|V]of the same typetest_with_insufficient_excols_in_reorderTests a
.hapwith anorder[H|R|V]that does not reference all extra columnstest_with_variant_inexistent_haplotype_idTests a
.hapwith with a variant that references a non-existent haplotypetest_with_missing_variant_in_pvarTests a
.hapalong with a.pvarfile which is missing an ID present in the.haptest_unreadable_hapfilePasses a non-existent file to the validator
Future work
It would be wise to document the code further on.
Looking towards developing optimizations for this command would be of great help too although we should evaluate how frequently this command is to be used and how big the input files usually are in order to determine the severity of this issue.
Checklist