Skip to content

feat: a validate subcommand to check whether a .hap file is valid #47

Open
@aryarm

Description

@aryarm

Important: Please familiarize yourself with the .hap file specification before reading this issue!


Originating from item 2 in the "Future Work" section of PR #43:

It would be useful to have a validate command that simply validates the .hap file, ensuring it follows the specification. An optional parameter to this command could turn on messages about best practices.

At first, this command should reject unsorted .hap files, but at some point we should also add a --no-sort parameter to support unsorted files, since those are also technically valid input.

For each violation of the standard, it would be nice if the validate subcommand reported the exact line that contains the issue, and ideally, it would quote the problematic part of the line, as well.

We should probably also add an optional argument that specifies the subcommand that this .hap file will be used as input for. That way, we can import its custom Haplotype class and acquire the expected extra field types from there.

Here are some rules it should check the .hap file follows:

  • are the line types all supported?
  • do H and R lines have at least four fields and do V lines have at least 5?
  • are there any extra fields in a line besides those that have been declared?
  • sorted lines (with the option to disable this check when the --no-sort flag is set)
  • can each of the values in the file be properly cast to the expected type?
    • for example, can the "Start Position" and "End Position" fields be cast to integers?
    • and can the values in each extra field be properly cast to the expected type?
  • is the start position less than the end position for each of the H, R, and V lines?
    • and for each H line: Is its start position $\le$ all of the start positions of all of the V lines that belong to it?
    • and for each H line: Is its end position $\ge$ all of the end positions of all of the V lines that belong to it?
  • are there any haplotypes with IDs that are the same as some chromosome names?
    • this is not allowed in our format b/c otherwise it would break the bgzip and tabix indexing
  • do the variant alleles contain only As, Cs, Gs, and Ts?
  • do the variant IDs match those in the genotypes file? (we should check this quickly without loading the genotypes if at all possible - potentially using pysam.tabix_iterator?)
    • We could make this an optional check by having it only happen when a --genotypes parameter specifying the path to the genotypes file is present.
    • If pysam.tabix_iterator works for this, then we should consider adding read_variants() and read_samples() methods to the GenotypesVCF class.
    • we should also check that the requested allele is present in the genotypes file
  • are the haplotype IDs unique? An H line can never have the same ID as an R line, but an H (or R) line can have the same ID as a V line
  • are the variant IDs within each haplotype unique?
  • are any fields empty (ie do they evaluate to the empty string)?
  • are any lines empty or completely blank?
  • do any of the variant lines refer to a haplotype that simply doesn't appear within the file?
  • are all haplotypes associated with at least one variant line?

And here are some rules for the header of the .hap file:

  • is there a version declared in the header of the file? (this is considered a best practice)
    • also, is the version string in a valid format?
    • and is the version up to date? (we can check this by using Haplotypes.check_version())
  • are all metadata names recognized? to check this, we should check whether there are any lines with a #, followed by a tab, followed by a recognized metadata name: currently, "version", "orderH", "orderV", and "orderR"
  • for "orderH", "orderV", and "orderR" metadata, are all of the extra fields declared there also declared in the header later on?
  • unless --no-sort is specified, do the metadata lines appear before the extra field declarations?
    • this is considered a best practice
  • extra fields are properly declared
    • is the declared type a valid python format specification? at the moment, we just support 's', 'd', and 'f'
    • do the extra lines in the header have all of the required fields?
    • is the order of the extra fields properly declared within a metadata line? (this is considered a best practice)
  • are there any extra field declarations for unrecognized line types (ie ones that aren't H, R, or V)?
    • these are technically ok, but it is considered best practice to exclude them
    • basically, are there any lines with a # followed immediately by a symbol other than "H", "R", or "V"?
      -[ ] are any header lines duplicated?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions