Skip to content

Add per-variant constraint utility functions#824

Open
jkgoodrich wants to merge 28 commits intomainfrom
jg/add_functions_for_per_base_constraint
Open

Add per-variant constraint utility functions#824
jkgoodrich wants to merge 28 commits intomainfrom
jg/add_functions_for_per_base_constraint

Conversation

@jkgoodrich
Copy link
Copy Markdown
Contributor

@jkgoodrich jkgoodrich commented Mar 18, 2026

Summary

Adds utility functions to gnomad.utils.constraint to support the per-variant constraint pipeline in gnomad-constraint. These were previously defined in the downstream gnomad_constraint package and are being promoted to the shared library for reusability.

Files changed

  • gnomad/utils/constraint.py — new and refactored functions (see below)
  • gnomad/utils/file_utils.pyprint_global_struct, convert_multi_array_to_array_of_structs
  • tests/utils/test_constraint.py — new test file (1,250 lines)
  • tests/utils/test_file_utils.py — new test file

New functions in constraint.py

Variant counting and observation:

  • variant_observed_expr — returns 0/1 for whether a variant meets frequency criteria (AC, AF, singleton)
  • variant_observed_and_possible_expr — per-variant observed and possible counts from a frequency array
  • counts_agg_expr — aggregation expression for variant and singleton counts across rows
  • weighted_sum_agg_expr — weighted aggregate sum supporting scalar and array expressions
  • count_observed_and_possible_by_group — group-by aggregation of observed/possible counts by context, ref, alt, and additional groupings; replaces the counting portion of count_variants_by_group with a simpler interface suited to the per-variant pipeline

Model application:

  • calibration_model_group_expr — builds the high/low coverage grouping annotation used by plateau and coverage models
  • apply_plateau_models — applies plateau linear models to mutation rates to get predicted proportion observed
  • coverage_correction_expr — computes coverage correction factor from the coverage model
  • apply_models — top-level function that chains plateau model application and coverage correction; replaces compute_expected_variants with a per-variant approach (annotates each variant) rather than returning aggregation expressions

Aggregation and grouping:

  • build_constraint_consequence_groups — builds constraint groups from consequence annotation and LoF modifier
  • aggregate_constraint_metrics_expr — sum aggregation expressions for constraint fields (mu, observed, expected, etc.); replaces oe_aggregation_expr with a more general interface that works on arbitrary field lists
  • _build_sum_agg_struct — helper for building sum aggregation structs
  • _resolve_annotation_expr — helper to resolve an explicit expression or fall back to a named field on a Table

Ranking and binning:

  • rank_and_assign_bins — ranks values and assigns percentile bins
  • compute_percentile_thresholds — computes metric values at percentile boundaries
  • annotate_bins_by_threshold — assigns bins based on precomputed thresholds
  • rank_array_element_metrics — ranks elements within array fields across rows

Other:

  • calculate_gerp_cutoffs — computes GERP score percentile cutoffs from a context Table

Refactored functions in constraint.py

  • oe_confidence_interval — split into _oe_ci_gamma (new, uses hl.qgamma) and _oe_ci_discretized_poisson (extracted from the original implementation); added method parameter to select between them
  • calculate_raw_z_score — changed return type from StructExpression to Float64Expression

New functions in file_utils.py

  • print_global_struct — pretty-prints a Hail Table's global struct with nested indentation
  • convert_multi_array_to_array_of_structs — zips parallel array fields into a single array of structs

Test plan

  • Verify existing tests pass (pytest tests/)
  • Verify new test_constraint.py and test_file_utils.py tests pass
  • Validated against full v4.1.1 constraint pipeline run (prepare-context through aggregate-by-constraint-groups)

jkgoodrich and others added 22 commits February 7, 2025 07:59
… into jg/add_functions_for_per_base_constraint
… into jg/add_functions_for_per_base_constraint
Resolve both-added conflict in init_scripts/vep115-init.sh by keeping
the branch version with clean loftee setup and simplified VEP config.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ing GERP cutoffs

- Introduced `build_constraint_consequence_groups` to create constraint groups based on consequence and LoF modifier expressions.
- Added `calculate_gerp_cutoffs` to compute GERP score cutoffs at specified percentile thresholds.
- Enhanced `print_global_struct` for better visualization of Hail global structs.
- Implemented `convert_multi_array_to_array_of_structs` to combine parallel array fields into a single array of structs.
- Updated `mane_select_over_canonical_filter_expr` to select MANE Select transcripts with a fallback to canonical transcripts.
…nstraint utilities

- Simplified the `single_variant_count_expr` function by removing an unnecessary variable.
- Introduced a new test suite for the constraint utility module, covering functions like `oe_confidence_interval`, `calculate_raw_z_score`, and `calculate_gerp_cutoffs`.
- Added tests for the `mane_select_over_canonical_filter_expr` to ensure correct transcript selection behavior.
- Implemented tests for utility functions in `file_utils`, including `print_global_struct` and `convert_multi_array_to_array_of_structs`.
- Introduced a new `CLAUDE.md` file detailing the gnomad_methods project, including an overview, package structure, code style guidelines, and best practices for Hail.
- Added default fields for summing expected variants and GENCODE annotations in `constraint.py`.
- Refactored `single_variant_count_expr` to `variant_observed_expr` for clarity and consistency, with updated tests reflecting this change.
- Enhanced test coverage for the new `variant_observed_expr` and related functions in the constraint utilities.
…ities

- Renamed `single_variant_observed_and_possible_expr` to `variant_observed_and_possible_expr` for clarity.
- Updated `weighted_build_sum_agg_struct` to `weighted_sum_agg_expr` to improve consistency in naming.
- Added comprehensive tests for the new `variant_observed_and_possible_expr` function, covering various scenarios including observed and unobserved variants, possible variants with and without adjustments, and filtering by maximum allele frequency.
- Renamed `compute_oe_upper_percentile_thresholds` to `compute_percentile_thresholds` for clarity and consistency.
- Introduced `annotate_bins_by_threshold` function to facilitate threshold-based binning.
- Updated documentation to differentiate between rank-based and threshold-based binning methods.
- Adjusted tests to reflect the new function names and ensure comprehensive coverage for the updated functionality.
- Introduced a section in CLAUDE.md to encourage developers to document useful information discovered during development, including gotchas, API behavior, and schema quirks.
- Emphasized the importance of concise additions placed in appropriate sections to aid future developers.
@jkgoodrich jkgoodrich requested a review from a team as a code owner March 18, 2026 19:44
@jkgoodrich jkgoodrich self-assigned this Mar 18, 2026
- Introduced a docstring in `__init__.py` to describe the gnomAD utilities and resources package.
- Updated logging in `_resolve_annotation_expr` to use formatted strings for better readability.
- Added error handling in `_oe_ci_gamma` to check for the availability of `hl.qgamma`, raising a RuntimeError if not present, ensuring compatibility with Hail versions.
- Added a skip condition to the `test_gamma_returns_lower_and_upper` and `test_gamma_and_poisson_give_similar_results` methods in `test_constraint.py` to ensure compatibility with Hail versions that do not support `hl.qgamma`.
…ove rank annotation handling

- Updated the `rank_array_element_metrics` function to return the table with its original key restored after ranking.
- Enhanced the rank annotation process to use `or_missing` for unranked rows, ensuring correct typing without manual struct construction.
- Simplified the rank lookup and annotation logic for better readability and maintainability.
…ix support

- Added a `prefix` parameter to `rank_and_assign_bins` to allow customization of the output field names for ranks and bins.
- Updated the `rank_array_element_metrics` function to accept a `rank_field_prefix` parameter, passing it through to `rank_and_assign_bins` for consistent naming in rank structs.
- Adjusted documentation to reflect the new parameters and their default values.
@mike-w-wilson mike-w-wilson self-requested a review March 30, 2026 14:16
@mike-w-wilson mike-w-wilson self-assigned this Mar 30, 2026
Copy link
Copy Markdown
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are mostly doc and style/ responses to TODOs. I think CLAUDE deserves it's own PR considering the style changes it brings in and we should have general discussion/input from all main contributors for it.

", ".join(grouping.keys()),
)

if max_af:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks at AF 0.0, I kow we didnt touch it but since the CLAUDE.md specifically calls it out we should update

"""
if singleton:
return hl.int(freq_expr[i].AC == 1)
elif max_af:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update because of the max_af of 0.0 silent failure


The calibration model expression is a struct with the following fields:

- genomic_region: The genomic region of the variant ("autosome_or_par",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field is not added by default -- it would need to be in that additional_grouping_expr and then would be nested inside of the model_group

equal to 'upper_cov_cutoff' (if provided). The variant is assigned to the low
coverage model if `skip_coverage_model` is False and the exome coverage is
greater than 'low_cov_cutoff' (if provided) and less than 'high_cov_cutoff'.
- cpg: Whether the variant is a CpG (`cpg_expr`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nested inside of the model_group struct

hl.agg.sum(high_cov_ht.observed_variants)
/ hl.agg.sum(high_cov_ht.possible_variants * high_cov_ht.mu_snp)
autosome_or_par_expr = (
ht.build_model.model_group.genomic_region == "autosome_or_par"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genomic_region doesnt exist in the default calibration_model_group_expr if model_group_expr is None

assert rows[9].bins.rank == 9


class TestComputeOeUpperPercentileThresholds:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class TestComputeOeUpperPercentileThresholds:
class TestComputePercentileThresholds:

assert len(tied_thresh_bins) == 1, "Threshold-based should not split ties"


class TestSingleVariantCountExpr:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class TestSingleVariantCountExpr:
class TestVariantObesrvedExpr:

assert result.observed_variants == [0, 1]


class TestGetCountsAggExpr:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class TestGetCountsAggExpr:
class TestCountsAggExpr:

assert result.variant_count == 0


class TestWeightedAggSumExpr:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class TestWeightedAggSumExpr:
class TestWeightedSumAggExpr:

@@ -0,0 +1,230 @@
# gnomad_methods Project Reference
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets separate this out of this constraint PR and create different CLAUDE PR

@@ -0,0 +1 @@
"""gnomAD utilities and resources package."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we exclude this file before to not interfere with pypi?

:return: The resolved Hail expression.
"""
if expr is None and (t is None or annotation_name is None):
raise ValueError("Either t and annotation_name or expr must be provided.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("Either t and annotation_name or expr must be provided.")
raise ValueError("Either 't' and 'annotation_name' or 'expr' must be provided.")

)


def _resolve_annotation_expr(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not have a test because it's a helper function?

raise ValueError("Either ht or freq_expr must be provided.")

if max_af is not None or singleton:
freq_expr = _resolve_annotation_expr(ht, "freq", freq_expr, "freq_expr")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only set freq_expr if max_af or singleton are defined?

class TestSingleVariantCountExpr:
"""Test the variant_observed_expr function."""

def test_ac_positive_counts_as_one(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there also be a a test for the count_missing param?

return ht.annotate(**{field_name: hl.struct(**granularities_expr)})


def rank_array_element_metrics(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently no test for this in test/utils/test_constraint.py

return hl.agg.group_by(filter_expr, agg_expr).get(True, hl.missing(agg_expr.dtype))


def apply_plateau_models(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently no test for this in test/utils/test_constraint.py

return _apply_model(plateau_models_expr)


def coverage_correction_expr(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently no test for this in test/utils/test_constraint.py

)


def apply_models(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently no test for this in test/utils/test_constraint.py

return apply_expr


def aggregate_constraint_metrics_expr(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently no test for this in test/utils/test_constraint.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants