feat: a `validate` subcommand to check whether a `.hap` file is valid

**Important:** Please familiarize yourself with [the `.hap` file specification](https://haptools.readthedocs.io/en/stable/formats/haplotypes.html) before reading this issue!

---

Originating from item 2 in the "Future Work" section of PR https://github.com/gymrek-lab/haptools/pull/43:

> It would be useful to have a validate command that simply validates the `.hap` file, ensuring it follows the specification. An optional parameter to this command could turn on messages about best practices.

At first, this command should reject unsorted `.hap` files, but at some point we should also add a `--no-sort` parameter to support unsorted files, since those are also technically valid input.

For each violation of the standard, it would be nice if the `validate` subcommand reported the exact line that contains the issue, and ideally, it would quote the problematic part of the line, as well.

We should probably also add an optional argument that specifies the subcommand that this `.hap` file will be used as input for. That way, we can import its custom Haplotype class and acquire the expected extra field types from there.

Here are some rules it should check the `.hap` file follows:
- [x] are the line types all supported?
- [x] do H and R lines have at least four fields and do V lines have at least 5?
- [x] are there any extra fields in a line besides those that have been declared?
- [x] sorted lines (with the option to disable this check when the `--no-sort` flag is set)
- [x] can each of the values in the file be properly cast to the expected type?
    - [x] for example, can the "Start Position" and "End Position" fields be cast to integers?
    - [x] and can the values in each extra field be properly cast to the expected type?
- [x] is the start position less than the end position for each of the H, R, and V lines?
    - [x] and for each H line: Is its start position $\le$ all of the start positions of all of the V lines that belong to it?
    - [x] and for each H line: Is its end position $\ge$ all of the end positions of all of the V lines that belong to it?
- [x] are there any haplotypes with IDs that are the same as some chromosome names?
    - this is not allowed in our format b/c otherwise it would break the bgzip and tabix indexing
- [x] do the variant alleles contain only As, Cs, Gs, and Ts?
- [ ] do the variant IDs match those in the genotypes file? (we should check this _quickly_ without loading the genotypes if at all possible - potentially using `pysam.tabix_iterator`?)
    - [ ] We could make this an optional check by having it only happen when a `--genotypes` parameter specifying the path to the genotypes file is present.
    - [ ] If `pysam.tabix_iterator` works for this, then we should consider adding `read_variants()` and `read_samples()` methods to the GenotypesVCF class.
    - [ ] we should also check that the requested allele is present in the genotypes file
- [ ] are the haplotype IDs unique? An H line can never have the same ID as an R line, but an H (or R) line _can_ have the same ID as a V line
- [x] are the variant IDs within each haplotype unique?
- [x] are any fields empty (ie do they evaluate to the empty string)?
- [x] are any lines empty or completely blank?
- [x] do any of the variant lines refer to a haplotype that simply doesn't appear within the file?
- [x] are all haplotypes associated with at least one variant line?

And here are some rules for the header of the `.hap` file:
- [x] is there a version declared in the header of the file? (this is considered a best practice)
    - [x] also, is the version string in a valid format?
    - [x] and is the version up to date? (we can check this by using `Haplotypes.check_version()`)
- [x] are all metadata names recognized? to check this, we should check whether there are any lines with a `#`, followed by a tab, followed by a recognized metadata name: currently, "version", "orderH", "orderV", and "orderR"
- [x] for "orderH", "orderV", and "orderR" metadata, are all of the extra fields declared there also declared in the header later on?
- [x] unless `--no-sort` is specified, do the metadata lines appear before the extra field declarations?
    - this is considered a best practice
- [x] extra fields are properly declared
    - [x] is the declared type a valid [python format specification](https://docs.python.org/3/library/string.html#format-specification-mini-language)? at the moment, we just support 's', 'd', and 'f'
    - [x] do the extra lines in the header have all of the required fields?
    - [x] is the order of the extra fields properly declared within a metadata line? (this is considered a best practice)
- [x] are there any extra field declarations for unrecognized line types (ie ones that aren't H, R, or V)?
    - these are technically ok, but it is considered best practice to exclude them
    - basically, are there any lines with a `#` followed immediately by a symbol other than "H", "R", or "V"?
-[ ] are any header lines duplicated? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: a `validate` subcommand to check whether a `.hap` file is valid #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: a validate subcommand to check whether a .hap file is valid #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

feat: a `validate` subcommand to check whether a `.hap` file is valid #47