Add per-variant constraint utility functions#824
Conversation
…me way as autosomes
…domains", "uniprot_isoform"
… into jg/add_functions_for_per_base_constraint
… into jg/add_functions_for_per_base_constraint
Resolve both-added conflict in init_scripts/vep115-init.sh by keeping the branch version with clean loftee setup and simplified VEP config. Co-authored-by: Cursor <cursoragent@cursor.com>
… exclude both "Y_PAR" and "PAR_Y" versions.
…ing GERP cutoffs - Introduced `build_constraint_consequence_groups` to create constraint groups based on consequence and LoF modifier expressions. - Added `calculate_gerp_cutoffs` to compute GERP score cutoffs at specified percentile thresholds. - Enhanced `print_global_struct` for better visualization of Hail global structs. - Implemented `convert_multi_array_to_array_of_structs` to combine parallel array fields into a single array of structs. - Updated `mane_select_over_canonical_filter_expr` to select MANE Select transcripts with a fallback to canonical transcripts.
…nstraint utilities - Simplified the `single_variant_count_expr` function by removing an unnecessary variable. - Introduced a new test suite for the constraint utility module, covering functions like `oe_confidence_interval`, `calculate_raw_z_score`, and `calculate_gerp_cutoffs`. - Added tests for the `mane_select_over_canonical_filter_expr` to ensure correct transcript selection behavior. - Implemented tests for utility functions in `file_utils`, including `print_global_struct` and `convert_multi_array_to_array_of_structs`.
- Introduced a new `CLAUDE.md` file detailing the gnomad_methods project, including an overview, package structure, code style guidelines, and best practices for Hail. - Added default fields for summing expected variants and GENCODE annotations in `constraint.py`. - Refactored `single_variant_count_expr` to `variant_observed_expr` for clarity and consistency, with updated tests reflecting this change. - Enhanced test coverage for the new `variant_observed_expr` and related functions in the constraint utilities.
…ities - Renamed `single_variant_observed_and_possible_expr` to `variant_observed_and_possible_expr` for clarity. - Updated `weighted_build_sum_agg_struct` to `weighted_sum_agg_expr` to improve consistency in naming. - Added comprehensive tests for the new `variant_observed_and_possible_expr` function, covering various scenarios including observed and unobserved variants, possible variants with and without adjustments, and filtering by maximum allele frequency.
- Renamed `compute_oe_upper_percentile_thresholds` to `compute_percentile_thresholds` for clarity and consistency. - Introduced `annotate_bins_by_threshold` function to facilitate threshold-based binning. - Updated documentation to differentiate between rank-based and threshold-based binning methods. - Adjusted tests to reflect the new function names and ensure comprehensive coverage for the updated functionality.
- Introduced a section in CLAUDE.md to encourage developers to document useful information discovered during development, including gotchas, API behavior, and schema quirks. - Emphasized the importance of concise additions placed in appropriate sections to aid future developers.
- Introduced a docstring in `__init__.py` to describe the gnomAD utilities and resources package.
- Updated logging in `_resolve_annotation_expr` to use formatted strings for better readability. - Added error handling in `_oe_ci_gamma` to check for the availability of `hl.qgamma`, raising a RuntimeError if not present, ensuring compatibility with Hail versions.
- Added a skip condition to the `test_gamma_returns_lower_and_upper` and `test_gamma_and_poisson_give_similar_results` methods in `test_constraint.py` to ensure compatibility with Hail versions that do not support `hl.qgamma`.
…ove rank annotation handling - Updated the `rank_array_element_metrics` function to return the table with its original key restored after ranking. - Enhanced the rank annotation process to use `or_missing` for unranked rows, ensuring correct typing without manual struct construction. - Simplified the rank lookup and annotation logic for better readability and maintainability.
…ix support - Added a `prefix` parameter to `rank_and_assign_bins` to allow customization of the output field names for ranks and bins. - Updated the `rank_array_element_metrics` function to accept a `rank_field_prefix` parameter, passing it through to `rank_and_assign_bins` for consistent naming in rank structs. - Adjusted documentation to reflect the new parameters and their default values.
mike-w-wilson
left a comment
There was a problem hiding this comment.
Comments are mostly doc and style/ responses to TODOs. I think CLAUDE deserves it's own PR considering the style changes it brings in and we should have general discussion/input from all main contributors for it.
| ", ".join(grouping.keys()), | ||
| ) | ||
|
|
||
| if max_af: |
There was a problem hiding this comment.
This breaks at AF 0.0, I kow we didnt touch it but since the CLAUDE.md specifically calls it out we should update
| """ | ||
| if singleton: | ||
| return hl.int(freq_expr[i].AC == 1) | ||
| elif max_af: |
There was a problem hiding this comment.
Also update because of the max_af of 0.0 silent failure
|
|
||
| The calibration model expression is a struct with the following fields: | ||
|
|
||
| - genomic_region: The genomic region of the variant ("autosome_or_par", |
There was a problem hiding this comment.
this field is not added by default -- it would need to be in that additional_grouping_expr and then would be nested inside of the model_group
| equal to 'upper_cov_cutoff' (if provided). The variant is assigned to the low | ||
| coverage model if `skip_coverage_model` is False and the exome coverage is | ||
| greater than 'low_cov_cutoff' (if provided) and less than 'high_cov_cutoff'. | ||
| - cpg: Whether the variant is a CpG (`cpg_expr`). |
There was a problem hiding this comment.
This is nested inside of the model_group struct
| hl.agg.sum(high_cov_ht.observed_variants) | ||
| / hl.agg.sum(high_cov_ht.possible_variants * high_cov_ht.mu_snp) | ||
| autosome_or_par_expr = ( | ||
| ht.build_model.model_group.genomic_region == "autosome_or_par" |
There was a problem hiding this comment.
genomic_region doesnt exist in the default calibration_model_group_expr if model_group_expr is None
| assert rows[9].bins.rank == 9 | ||
|
|
||
|
|
||
| class TestComputeOeUpperPercentileThresholds: |
There was a problem hiding this comment.
| class TestComputeOeUpperPercentileThresholds: | |
| class TestComputePercentileThresholds: |
| assert len(tied_thresh_bins) == 1, "Threshold-based should not split ties" | ||
|
|
||
|
|
||
| class TestSingleVariantCountExpr: |
There was a problem hiding this comment.
| class TestSingleVariantCountExpr: | |
| class TestVariantObesrvedExpr: |
| assert result.observed_variants == [0, 1] | ||
|
|
||
|
|
||
| class TestGetCountsAggExpr: |
There was a problem hiding this comment.
| class TestGetCountsAggExpr: | |
| class TestCountsAggExpr: |
| assert result.variant_count == 0 | ||
|
|
||
|
|
||
| class TestWeightedAggSumExpr: |
There was a problem hiding this comment.
| class TestWeightedAggSumExpr: | |
| class TestWeightedSumAggExpr: |
| @@ -0,0 +1,230 @@ | |||
| # gnomad_methods Project Reference | |||
There was a problem hiding this comment.
Lets separate this out of this constraint PR and create different CLAUDE PR
| @@ -0,0 +1 @@ | |||
| """gnomAD utilities and resources package.""" | |||
There was a problem hiding this comment.
didn't we exclude this file before to not interfere with pypi?
| :return: The resolved Hail expression. | ||
| """ | ||
| if expr is None and (t is None or annotation_name is None): | ||
| raise ValueError("Either t and annotation_name or expr must be provided.") |
There was a problem hiding this comment.
| raise ValueError("Either t and annotation_name or expr must be provided.") | |
| raise ValueError("Either 't' and 'annotation_name' or 'expr' must be provided.") |
| ) | ||
|
|
||
|
|
||
| def _resolve_annotation_expr( |
There was a problem hiding this comment.
does this not have a test because it's a helper function?
| raise ValueError("Either ht or freq_expr must be provided.") | ||
|
|
||
| if max_af is not None or singleton: | ||
| freq_expr = _resolve_annotation_expr(ht, "freq", freq_expr, "freq_expr") |
There was a problem hiding this comment.
why only set freq_expr if max_af or singleton are defined?
| class TestSingleVariantCountExpr: | ||
| """Test the variant_observed_expr function.""" | ||
|
|
||
| def test_ac_positive_counts_as_one(self): |
There was a problem hiding this comment.
should there also be a a test for the count_missing param?
| return ht.annotate(**{field_name: hl.struct(**granularities_expr)}) | ||
|
|
||
|
|
||
| def rank_array_element_metrics( |
There was a problem hiding this comment.
currently no test for this in test/utils/test_constraint.py
| return hl.agg.group_by(filter_expr, agg_expr).get(True, hl.missing(agg_expr.dtype)) | ||
|
|
||
|
|
||
| def apply_plateau_models( |
There was a problem hiding this comment.
currently no test for this in test/utils/test_constraint.py
| return _apply_model(plateau_models_expr) | ||
|
|
||
|
|
||
| def coverage_correction_expr( |
There was a problem hiding this comment.
currently no test for this in test/utils/test_constraint.py
| ) | ||
|
|
||
|
|
||
| def apply_models( |
There was a problem hiding this comment.
currently no test for this in test/utils/test_constraint.py
| return apply_expr | ||
|
|
||
|
|
||
| def aggregate_constraint_metrics_expr( |
There was a problem hiding this comment.
currently no test for this in test/utils/test_constraint.py
Summary
Adds utility functions to
gnomad.utils.constraintto support the per-variant constraint pipeline in gnomad-constraint. These were previously defined in the downstreamgnomad_constraintpackage and are being promoted to the shared library for reusability.Files changed
gnomad/utils/constraint.py— new and refactored functions (see below)gnomad/utils/file_utils.py—print_global_struct,convert_multi_array_to_array_of_structstests/utils/test_constraint.py— new test file (1,250 lines)tests/utils/test_file_utils.py— new test fileNew functions in
constraint.pyVariant counting and observation:
variant_observed_expr— returns 0/1 for whether a variant meets frequency criteria (AC, AF, singleton)variant_observed_and_possible_expr— per-variant observed and possible counts from a frequency arraycounts_agg_expr— aggregation expression for variant and singleton counts across rowsweighted_sum_agg_expr— weighted aggregate sum supporting scalar and array expressionscount_observed_and_possible_by_group— group-by aggregation of observed/possible counts by context, ref, alt, and additional groupings; replaces the counting portion ofcount_variants_by_groupwith a simpler interface suited to the per-variant pipelineModel application:
calibration_model_group_expr— builds the high/low coverage grouping annotation used by plateau and coverage modelsapply_plateau_models— applies plateau linear models to mutation rates to get predicted proportion observedcoverage_correction_expr— computes coverage correction factor from the coverage modelapply_models— top-level function that chains plateau model application and coverage correction; replacescompute_expected_variantswith a per-variant approach (annotates each variant) rather than returning aggregation expressionsAggregation and grouping:
build_constraint_consequence_groups— builds constraint groups from consequence annotation and LoF modifieraggregate_constraint_metrics_expr— sum aggregation expressions for constraint fields (mu, observed, expected, etc.); replacesoe_aggregation_exprwith a more general interface that works on arbitrary field lists_build_sum_agg_struct— helper for building sum aggregation structs_resolve_annotation_expr— helper to resolve an explicit expression or fall back to a named field on a TableRanking and binning:
rank_and_assign_bins— ranks values and assigns percentile binscompute_percentile_thresholds— computes metric values at percentile boundariesannotate_bins_by_threshold— assigns bins based on precomputed thresholdsrank_array_element_metrics— ranks elements within array fields across rowsOther:
calculate_gerp_cutoffs— computes GERP score percentile cutoffs from a context TableRefactored functions in
constraint.pyoe_confidence_interval— split into_oe_ci_gamma(new, useshl.qgamma) and_oe_ci_discretized_poisson(extracted from the original implementation); addedmethodparameter to select between themcalculate_raw_z_score— changed return type fromStructExpressiontoFloat64ExpressionNew functions in
file_utils.pyprint_global_struct— pretty-prints a Hail Table's global struct with nested indentationconvert_multi_array_to_array_of_structs— zips parallel array fields into a single array of structsTest plan
pytest tests/)test_constraint.pyandtest_file_utils.pytests pass