Skip to content

feat: implement a new validate command #220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
2b9f708
Create base for hapfile validation
Jul 14, 2023
6bd77fc
Solidify and improve validator base
Jul 20, 2023
79698fc
Raise error on type IDs which match chromosome IDs
Jul 20, 2023
ffc7935
Report errors for column additions on non-existent types
Jul 20, 2023
beabc98
Recognize extra columns and cast validation for extra column types
Jul 20, 2023
6b45589
Corrected bug where float values were unrecognized
Jul 20, 2023
2233b09
Allow parsing & reordering of extra columns
Jul 21, 2023
0cb6586
Append feature to cli
Jul 24, 2023
ba5f9d8
Fix bug where the validator would break if no repeats were provided
Jul 24, 2023
af7dbb5
Complete first working instance of the validator.
Jul 24, 2023
4c42e69
Add a pair of test files to the hapfile directory. Corrected a hapfile.
Jul 24, 2023
2693470
Format files with Black
Jul 24, 2023
79a845b
Create test for the validate command
Jul 26, 2023
76176c5
Create tests for validation command
Jul 26, 2023
c6f1f56
Remove debugging print statements
Jul 27, 2023
85a298b
fix pgenlib import issue
aryarm Jul 27, 2023
777114e
Add doc base for the valhap command
Jul 27, 2023
81d38f8
Merge branch 'impl-validate-command' of github.com:CAST-genomics/hapt…
Jul 27, 2023
747b43b
Clean up docs. Add further information.
Jul 27, 2023
f28b902
Fix indentation
Jul 27, 2023
dbe6d87
Fix format.
Jul 27, 2023
32468f9
rename from val_hapfile to to 'validate'
aryarm Jul 30, 2023
390eaeb
implement some suggestions from PR
aryarm Sep 14, 2023
8b324ac
Use relative import for logging module
aryarm Sep 14, 2023
57c81f8
accept pvar instead of pgen
aryarm Sep 15, 2023
61ac08c
change up logging to be silent by default when called from command line
aryarm Sep 16, 2023
c4ecaec
reformat test_validate.py for concision
aryarm Sep 16, 2023
e7efcf6
Merge branch 'main' into impl-validate-command
aryarm Sep 17, 2023
6bbee4b
rename test data dir and remove valhap prefix
aryarm Sep 17, 2023
4b95834
remove test code import prefix
aryarm Sep 17, 2023
1290c7b
Merge branch 'impl-validate-command' of github.com:CAST-genomics/hapt…
aryarm Sep 17, 2023
5614004
add tests for command line and add non zero exit code
aryarm Sep 17, 2023
474f9fc
clarify how sorting works
aryarm Sep 17, 2023
6b7942c
change behavior of sorting parameter
aryarm Sep 18, 2023
d16f7bd
do not skip pytest for pgenlib
aryarm Oct 1, 2023
9234bef
Merge branch 'main' into impl-validate-command
aryarm Oct 1, 2023
fc71adf
refmt with black
aryarm Oct 2, 2023
04ab0e3
Merge branch 'main' into impl-validate-command
aryarm Oct 14, 2023
6065862
remove extra files outside of test dir
aryarm Oct 14, 2023
50d5cb3
rename valhap test dir to validate
aryarm Oct 14, 2023
46ac080
add descriptions to all test commands
aryarm Oct 14, 2023
3db4522
fail validation if any lines are blank
aryarm Oct 14, 2023
6288b8d
add test for whitespace
aryarm Oct 14, 2023
0b0932c
add test for indexed hap file
aryarm Oct 14, 2023
6d81e26
start adding docstrings
aryarm Oct 14, 2023
189eed0
remove max_variants which we will instead infer from the hap file
aryarm Oct 14, 2023
c042b82
start HapFileValidator class commenting
aryarm Oct 29, 2023
3558764
add more comments to validate command
aryarm Nov 11, 2023
d91b2a3
document metadata line handling code
aryarm Feb 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/commands/validate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
.. _commands-validate:


validate
========

Validate the structure of a ``.hap`` file.

When a ``.hap`` file contains any errors, they will be logged accordingly.

Optionally, the haplotypes present in the ``.hap`` file can be compared against a ``.pgen`` file.

Usage
~~~~~
.. code-block:: bash

haptools validate \
--sort \
--genotypes PATH \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
HAPFILE

Examples
~~~~~~~~
.. code-block:: bash

haptools validate tests/data/hapfiles/basic.hap

Outputs a message specifying the amount of errors and warnings.

.. code-block::

[ INFO] Completed HapFile validation with 0 errors and 0 warnings.

All warnings and errors will be logged if there are any.

.. code-block:: bash

haptools validate tests/data/hapfiles/valhap_with_no_version.hap

.. code-block::

[ WARNING] No version declaration found. Assuming to use the latest version.
[ INFO] Completed HapFile validation with 0 errors and 1 warnings.
[ WARNING] Found several warnings and / or errors in the hapfile

One can use ``--no-sort`` to avoid sorting the file.
This will make it so that all unordered files will get removed, such as out-of-header lines with meta information.

.. code-block:: bash

haptools validate --no-sort tests/data/hapfiles/valhap_with_out_of_header_metas.hap

Will turn:

.. code-block::

# orderH ancestry beta
# version 0.2.0
#H ancestry s Local ancestry
#H beta .2f Effect size in linear model
#R beta .2f Effect size in linear model
H 21 26928472 26941960 chr21.q.3365*1 ASW 0.73
R 21 26938353 26938400 21_26938353_STR 0.45
H 21 26938989 26941960 chr21.q.3365*10 CEU 0.30
H 21 26938353 26938989 chr21.q.3365*11 MXL 0.49
# This should cause an error if the file is sorted
#V test_field s A field to test with
V chr21.q.3365*1 26928472 26928472 21_26928472_C_A C
V chr21.q.3365*1 26938353 26938353 21_26938353_T_C T
V chr21.q.3365*1 26940815 26940815 21_26940815_T_C C
V chr21.q.3365*1 26941960 26941960 21_26941960_A_G G
V chr21.q.3365*10 26938989 26938989 21_26938989_G_A A
V chr21.q.3365*10 26940815 26940815 21_26940815_T_C T
V chr21.q.3365*10 26941960 26941960 21_26941960_A_G A
V chr21.q.3365*11 26938353 26938353 21_26938353_T_C T
V chr21.q.3365*11 26938989 26938989 21_26938989_G_A A

Into

.. code-block::

# orderH ancestry beta
# version 0.2.0
#H ancestry s Local ancestry
#H beta .2f Effect size in linear model
#R beta .2f Effect size in linear model
H 21 26928472 26941960 chr21.q.3365*1 ASW 0.73
R 21 26938353 26938400 21_26938353_STR 0.45
H 21 26938989 26941960 chr21.q.3365*10 CEU 0.30
H 21 26938353 26938989 chr21.q.3365*11 MXL 0.49
V chr21.q.3365*1 26928472 26928472 21_26928472_C_A C
V chr21.q.3365*1 26938353 26938353 21_26938353_T_C T
V chr21.q.3365*1 26940815 26940815 21_26940815_T_C C
V chr21.q.3365*1 26941960 26941960 21_26941960_A_G G
V chr21.q.3365*10 26938989 26938989 21_26938989_G_A A
V chr21.q.3365*10 26940815 26940815 21_26940815_T_C T
V chr21.q.3365*10 26941960 26941960 21_26941960_A_G A
V chr21.q.3365*11 26938353 26938353 21_26938353_T_C T
V chr21.q.3365*11 26938989 26938989 21_26938989_G_A A


If the previous example were to be sorted then there would be several errors in the ``.hap`` file.
All sorted files parse the meta information lines first, thus the ``V`` lines would be incomplete.

As mentioned before, one can use the ``--genotypes`` flag to provide a ``.pgen`` file with which to compare the existence of variant IDs.
The following will check if all of the variant IDs in the ``.hap`` appear in the ``.pvar`` associated to the ``.pgen``.

.. code-block:: bash

haptools validate --genotypes tests/data/hapfiles/valhap_test_data.pgen tests/data/hapfiles/valhap_test_data.hap

.. warning::

You must generate a ``.pvar`` from your ``.pgen`` file.
This is done in order to avoid reading heavy amounts of
information which is not relevant to the validation process.

Detailed Usage
~~~~~~~~~~~~~~

.. click:: haptools.__main__:main
:prog: haptools
:show-nested:
:commands: validate
3 changes: 3 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ Commands

* :doc:`haptools transform </commands/transform>`: Transform a set of genotypes via a list of haplotypes. Create a new VCF containing haplotypes instead of variants.

* :doc:`haptools validate </commands/validate>`: Validate the formatting of a haplotype file.

* :doc:`haptools index </commands/index>`: Sort, compress, and index our custom file format for haplotypes.

* :doc:`haptools clump </commands/clump>`: Convert variants in LD with one another into clumps.
Expand Down Expand Up @@ -95,6 +97,7 @@ There is an option to *Cite this repository* on the right sidebar of `the reposi
commands/simphenotype.rst
commands/karyogram.rst
commands/transform.rst
commands/validate.rst
commands/index.rst
commands/clump.rst
commands/ld.rst
Expand Down
44 changes: 44 additions & 0 deletions haptools/__main__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#!/usr/bin/env python

from __future__ import annotations
from enum import Flag
import sys
from pathlib import Path

Expand Down Expand Up @@ -1025,6 +1026,49 @@ def clump(
)


@main.command(short_help="Validate the structure of a .hap file")
@click.argument("filename", type=click.Path(exists=True, path_type=Path))
@click.option(
"--sort/--no-sort",
is_flag=True,
default=True,
show_default=True,
help="Sorting of the file will not be performed",
)
@click.option(
"--genotypes",
type=click.Path(path_type=Path),
default=None,
show_default="optional .pvar file to compare against",
help=(
"A .pvar file containing variant IDs in order to compare them to the .hap file"
),
)
@click.option(
"-v",
"--verbosity",
type=click.Choice(["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"]),
default="INFO",
show_default=True,
help="The level of verbosity desired",
)
def validate(
filename: Path,
sort: bool,
genotypes: Path | None = None,
verbosity: str = "DEBUG",
):
from .logging import getLogger
from .validate import is_hapfile_valid

log = getLogger(name="validate", level=verbosity)

is_valid = is_hapfile_valid(filename, sorted=sort, logger=log, pgen=genotypes)

if not is_valid:
log.warn("Found several warnings and / or errors in the hapfile")


if __name__ == "__main__":
# run the CLI if someone tries 'python -m haptools' on the command line
main(prog_name="haptools")
Loading