Add per-variant constraint utility functions by jkgoodrich · Pull Request #824 · broadinstitute/gnomad_methods

jkgoodrich · 2026-03-18T19:44:54Z

Summary

Adds utility functions to gnomad.utils.constraint to support the per-variant constraint pipeline in gnomad-constraint. These were previously defined in the downstream gnomad_constraint package and are being promoted to the shared library for reusability.

Files changed

gnomad/utils/constraint.py — new and refactored functions (see below)
gnomad/utils/file_utils.py — print_global_struct, convert_multi_array_to_array_of_structs
tests/utils/test_constraint.py — new test file (1,250 lines)
tests/utils/test_file_utils.py — new test file

New functions in `constraint.py`

Variant counting and observation:

variant_observed_expr — returns 0/1 for whether a variant meets frequency criteria (AC, AF, singleton)
variant_observed_and_possible_expr — per-variant observed and possible counts from a frequency array
counts_agg_expr — aggregation expression for variant and singleton counts across rows
weighted_sum_agg_expr — weighted aggregate sum supporting scalar and array expressions
count_observed_and_possible_by_group — group-by aggregation of observed/possible counts by context, ref, alt, and additional groupings; replaces the counting portion of count_variants_by_group with a simpler interface suited to the per-variant pipeline

Model application:

calibration_model_group_expr — builds the high/low coverage grouping annotation used by plateau and coverage models
apply_plateau_models — applies plateau linear models to mutation rates to get predicted proportion observed
coverage_correction_expr — computes coverage correction factor from the coverage model
apply_models — top-level function that chains plateau model application and coverage correction; replaces compute_expected_variants with a per-variant approach (annotates each variant) rather than returning aggregation expressions

Aggregation and grouping:

build_constraint_consequence_groups — builds constraint groups from consequence annotation and LoF modifier
aggregate_constraint_metrics_expr — sum aggregation expressions for constraint fields (mu, observed, expected, etc.); replaces oe_aggregation_expr with a more general interface that works on arbitrary field lists
_build_sum_agg_struct — helper for building sum aggregation structs
_resolve_annotation_expr — helper to resolve an explicit expression or fall back to a named field on a Table

Ranking and binning:

rank_and_assign_bins — ranks values and assigns percentile bins
compute_percentile_thresholds — computes metric values at percentile boundaries
annotate_bins_by_threshold — assigns bins based on precomputed thresholds
rank_array_element_metrics — ranks elements within array fields across rows

Other:

calculate_gerp_cutoffs — computes GERP score percentile cutoffs from a context Table

Refactored functions in `constraint.py`

oe_confidence_interval — split into _oe_ci_gamma (new, uses hl.qgamma) and _oe_ci_discretized_poisson (extracted from the original implementation); added method parameter to select between them
calculate_raw_z_score — changed return type from StructExpression to Float64Expression

New functions in `file_utils.py`

print_global_struct — pretty-prints a Hail Table's global struct with nested indentation
convert_multi_array_to_array_of_structs — zips parallel array fields into a single array of structs

Test plan

Verify existing tests pass (pytest tests/)
Verify new test_constraint.py and test_file_utils.py tests pass
Validated against full v4.1.1 constraint pipeline run (prepare-context through aggregate-by-constraint-groups)

…hub.com/broadinstitute/gnomad_methods

…me way as autosomes

…domains", "uniprot_isoform"

… into jg/add_functions_for_per_base_constraint

Resolve both-added conflict in init_scripts/vep115-init.sh by keeping the branch version with clean loftee setup and simplified VEP config. Co-authored-by: Cursor <cursoragent@cursor.com>

… exclude both "Y_PAR" and "PAR_Y" versions.

…ing GERP cutoffs - Introduced `build_constraint_consequence_groups` to create constraint groups based on consequence and LoF modifier expressions. - Added `calculate_gerp_cutoffs` to compute GERP score cutoffs at specified percentile thresholds. - Enhanced `print_global_struct` for better visualization of Hail global structs. - Implemented `convert_multi_array_to_array_of_structs` to combine parallel array fields into a single array of structs. - Updated `mane_select_over_canonical_filter_expr` to select MANE Select transcripts with a fallback to canonical transcripts.

…nstraint utilities - Simplified the `single_variant_count_expr` function by removing an unnecessary variable. - Introduced a new test suite for the constraint utility module, covering functions like `oe_confidence_interval`, `calculate_raw_z_score`, and `calculate_gerp_cutoffs`. - Added tests for the `mane_select_over_canonical_filter_expr` to ensure correct transcript selection behavior. - Implemented tests for utility functions in `file_utils`, including `print_global_struct` and `convert_multi_array_to_array_of_structs`.

- Introduced a new `CLAUDE.md` file detailing the gnomad_methods project, including an overview, package structure, code style guidelines, and best practices for Hail. - Added default fields for summing expected variants and GENCODE annotations in `constraint.py`. - Refactored `single_variant_count_expr` to `variant_observed_expr` for clarity and consistency, with updated tests reflecting this change. - Enhanced test coverage for the new `variant_observed_expr` and related functions in the constraint utilities.

…ities - Renamed `single_variant_observed_and_possible_expr` to `variant_observed_and_possible_expr` for clarity. - Updated `weighted_build_sum_agg_struct` to `weighted_sum_agg_expr` to improve consistency in naming. - Added comprehensive tests for the new `variant_observed_and_possible_expr` function, covering various scenarios including observed and unobserved variants, possible variants with and without adjustments, and filtering by maximum allele frequency.

- Renamed `compute_oe_upper_percentile_thresholds` to `compute_percentile_thresholds` for clarity and consistency. - Introduced `annotate_bins_by_threshold` function to facilitate threshold-based binning. - Updated documentation to differentiate between rank-based and threshold-based binning methods. - Adjusted tests to reflect the new function names and ensure comprehensive coverage for the updated functionality.

- Introduced a section in CLAUDE.md to encourage developers to document useful information discovered during development, including gotchas, API behavior, and schema quirks. - Emphasized the importance of concise additions placed in appropriate sections to aid future developers.

- Introduced a docstring in `__init__.py` to describe the gnomAD utilities and resources package.

- Updated logging in `_resolve_annotation_expr` to use formatted strings for better readability. - Added error handling in `_oe_ci_gamma` to check for the availability of `hl.qgamma`, raising a RuntimeError if not present, ensuring compatibility with Hail versions.

- Added a skip condition to the `test_gamma_returns_lower_and_upper` and `test_gamma_and_poisson_give_similar_results` methods in `test_constraint.py` to ensure compatibility with Hail versions that do not support `hl.qgamma`.

…ove rank annotation handling - Updated the `rank_array_element_metrics` function to return the table with its original key restored after ranking. - Enhanced the rank annotation process to use `or_missing` for unranked rows, ensuring correct typing without manual struct construction. - Simplified the rank lookup and annotation logic for better readability and maintainability.

…ix support - Added a `prefix` parameter to `rank_and_assign_bins` to allow customization of the output field names for ranks and bins. - Updated the `rank_array_element_metrics` function to accept a `rank_field_prefix` parameter, passing it through to `rank_and_assign_bins` for consistent naming in rank structs. - Adjusted documentation to reflect the new parameters and their default values.

mike-w-wilson

Comments are mostly doc and style/ responses to TODOs. I think CLAUDE deserves it's own PR considering the style changes it brings in and we should have general discussion/input from all main contributors for it.

mike-w-wilson · 2026-03-30T14:21:40Z

gnomad/utils/constraint.py

        ", ".join(grouping.keys()),
    )

    if max_af:


This breaks at AF 0.0, I kow we didnt touch it but since the CLAUDE.md specifically calls it out we should update

mike-w-wilson · 2026-03-30T14:24:37Z

gnomad/utils/constraint.py

        """
        if singleton:
            return hl.int(freq_expr[i].AC == 1)
        elif max_af:


Also update because of the max_af of 0.0 silent failure

mike-w-wilson · 2026-03-30T14:36:42Z

gnomad/utils/constraint.py

+
+    The calibration model expression is a struct with the following fields:
+
+        - genomic_region: The genomic region of the variant ("autosome_or_par",


this field is not added by default -- it would need to be in that additional_grouping_expr and then would be nested inside of the model_group

mike-w-wilson · 2026-03-30T14:37:02Z

gnomad/utils/constraint.py

+          equal to 'upper_cov_cutoff' (if provided). The variant is assigned to the low
+          coverage model if `skip_coverage_model` is False and the exome coverage is
+          greater than 'low_cov_cutoff' (if provided) and less than 'high_cov_cutoff'.
+        - cpg: Whether the variant is a CpG (`cpg_expr`).


This is nested inside of the model_group struct

mike-w-wilson · 2026-03-30T14:39:11Z

gnomad/utils/constraint.py

-            hl.agg.sum(high_cov_ht.observed_variants)
-            / hl.agg.sum(high_cov_ht.possible_variants * high_cov_ht.mu_snp)
+        autosome_or_par_expr = (
+            ht.build_model.model_group.genomic_region == "autosome_or_par"


genomic_region doesnt exist in the default calibration_model_group_expr if model_group_expr is None

mike-w-wilson · 2026-03-30T17:24:17Z

tests/utils/test_constraint.py

+        assert rows[9].bins.rank == 9
+
+
+class TestComputeOeUpperPercentileThresholds:


Suggested change

class TestComputeOeUpperPercentileThresholds:

class TestComputePercentileThresholds:

mike-w-wilson · 2026-03-30T17:25:03Z

tests/utils/test_constraint.py

+        assert len(tied_thresh_bins) == 1, "Threshold-based should not split ties"
+
+
+class TestSingleVariantCountExpr:


Suggested change

class TestSingleVariantCountExpr:

class TestVariantObesrvedExpr:

mike-w-wilson · 2026-03-30T17:25:25Z

tests/utils/test_constraint.py

+        assert result.observed_variants == [0, 1]
+
+
+class TestGetCountsAggExpr:


Suggested change

class TestGetCountsAggExpr:

class TestCountsAggExpr:

mike-w-wilson · 2026-03-30T17:25:51Z

tests/utils/test_constraint.py

+        assert result.variant_count == 0
+
+
+class TestWeightedAggSumExpr:


Suggested change

class TestWeightedAggSumExpr:

class TestWeightedSumAggExpr:

mike-w-wilson · 2026-03-30T17:38:19Z

CLAUDE.md

@@ -0,0 +1,230 @@
+# gnomad_methods Project Reference


Lets separate this out of this constraint PR and create different CLAUDE PR

klaricch · 2026-03-31T15:30:56Z

gnomad/__init__.py

@@ -0,0 +1 @@
+"""gnomAD utilities and resources package."""


didn't we exclude this file before to not interfere with pypi?

klaricch · 2026-04-01T13:18:00Z

gnomad/utils/constraint.py

+    :return: The resolved Hail expression.
+    """
+    if expr is None and (t is None or annotation_name is None):
+        raise ValueError("Either t and annotation_name or expr must be provided.")


Suggested change

raise ValueError("Either t and annotation_name or expr must be provided.")

raise ValueError("Either 't' and 'annotation_name' or 'expr' must be provided.")

klaricch · 2026-04-01T13:44:23Z

gnomad/utils/constraint.py

    )


+def _resolve_annotation_expr(


does this not have a test because it's a helper function?

klaricch · 2026-04-01T14:29:35Z

gnomad/utils/constraint.py

+        raise ValueError("Either ht or freq_expr must be provided.")
+
+    if max_af is not None or singleton:
+        freq_expr = _resolve_annotation_expr(ht, "freq", freq_expr, "freq_expr")


why only set freq_expr if max_af or singleton are defined?

klaricch · 2026-04-01T14:55:03Z

tests/utils/test_constraint.py

+class TestSingleVariantCountExpr:
+    """Test the variant_observed_expr function."""
+
+    def test_ac_positive_counts_as_one(self):


should there also be a a test for the count_missing param?

klaricch · 2026-04-02T17:55:52Z

gnomad/utils/constraint.py

+    return ht.annotate(**{field_name: hl.struct(**granularities_expr)})
+
+
+def rank_array_element_metrics(


currently no test for this in test/utils/test_constraint.py

klaricch · 2026-04-03T15:17:40Z

gnomad/utils/constraint.py

    return hl.agg.group_by(filter_expr, agg_expr).get(True, hl.missing(agg_expr.dtype))


+def apply_plateau_models(


currently no test for this in test/utils/test_constraint.py

klaricch · 2026-04-03T15:18:08Z

gnomad/utils/constraint.py

+    return _apply_model(plateau_models_expr)
+
+
+def coverage_correction_expr(


currently no test for this in test/utils/test_constraint.py

klaricch · 2026-04-03T15:19:22Z

gnomad/utils/constraint.py

+    )
+
+
+def apply_models(


currently no test for this in test/utils/test_constraint.py

klaricch · 2026-04-03T15:19:48Z

gnomad/utils/constraint.py

+    return apply_expr
+
+
+def aggregate_constraint_metrics_expr(


currently no test for this in test/utils/test_constraint.py

jkgoodrich and others added 22 commits February 7, 2025 07:59

Merge branch 'jg/determine_end_trunc_filter_from_gerp' of https://git…

7ad8b20

…hub.com/broadinstitute/gnomad_methods

Add several functions that are helpful for per base constraint

7a89505

Fixes while testing

f69d2f0

Allow _sum_agg_expr to take a StructExpression as input too

02e0e21

modify transform_grch38_methylation to handle par regions in the sa…

a426cc4

…me way as autosomes

Add mu to agg

5e5b0c9

Add filter to autosomes and par for coverage correction

2626118

Add coverage correction to the mu sum

476e364

Keep additional annotations in vep: "sift_score", "polyphen_score", "…

7d85a75

…domains", "uniprot_isoform"

Add several functions that are helpful for per base constraint

1dc7df1

Merge branch 'main' of https://github.com/broadinstitute/gnomad_methods…

26ba4d0

… into jg/add_functions_for_per_base_constraint

Add change to import_gencode to include version

bd74b03

Merge branch 'main' of https://github.com/broadinstitute/gnomad_methods…

86c333b

… into jg/add_functions_for_per_base_constraint

add codons to vep annotations to keep

5dc4cee

Merge branch 'main' into jg/add_functions_for_per_base_constraint

0543031

Resolve both-added conflict in init_scripts/vep115-init.sh by keeping the branch version with clean loftee setup and simplified VEP config. Co-authored-by: Cursor <cursoragent@cursor.com>

Enhance transcript filtering in add_gencode_transcript_annotations to…

8288e33

… exclude both "Y_PAR" and "PAR_Y" versions.

jkgoodrich requested a review from a team as a code owner March 18, 2026 19:44

jkgoodrich self-assigned this Mar 18, 2026

jkgoodrich added 6 commits March 18, 2026 13:53

Add gnomAD package initialization with module documentation

270921c

- Introduced a docstring in `__init__.py` to describe the gnomAD utilities and resources package.

Add skip condition for tests requiring hl.qgamma

74d2de5

- Added a skip condition to the `test_gamma_returns_lower_and_upper` and `test_gamma_and_poisson_give_similar_results` methods in `test_constraint.py` to ensure compatibility with Hail versions that do not support `hl.qgamma`.

Cache the result of rank_array_element_metrics to improve performance

592f646

mike-w-wilson self-requested a review March 30, 2026 14:16

mike-w-wilson self-assigned this Mar 30, 2026

mike-w-wilson reviewed Mar 30, 2026

View reviewed changes

mike-w-wilson assigned klaricch Mar 30, 2026

mike-w-wilson requested a review from klaricch March 30, 2026 17:57

klaricch reviewed Apr 3, 2026

View reviewed changes


		The calibration model expression is a struct with the following fields:

		- genomic_region: The genomic region of the variant ("autosome_or_par",

		assert rows[9].bins.rank == 9


		class TestComputeOeUpperPercentileThresholds:

	class TestComputeOeUpperPercentileThresholds:
	class TestComputePercentileThresholds:

		assert len(tied_thresh_bins) == 1, "Threshold-based should not split ties"


		class TestSingleVariantCountExpr:

	class TestSingleVariantCountExpr:
	class TestVariantObesrvedExpr:

		assert result.observed_variants == [0, 1]


		class TestGetCountsAggExpr:

		assert result.variant_count == 0


		class TestWeightedAggSumExpr:

	raise ValueError("Either t and annotation_name or expr must be provided.")
	raise ValueError("Either 't' and 'annotation_name' or 'expr' must be provided.")

		return ht.annotate({field_name: hl.struct(granularities_expr)})


		def rank_array_element_metrics(

		return hl.agg.group_by(filter_expr, agg_expr).get(True, hl.missing(agg_expr.dtype))


		def apply_plateau_models(

		return _apply_model(plateau_models_expr)


		def coverage_correction_expr(

Conversation

jkgoodrich commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

New functions in constraint.py

Refactored functions in constraint.py

New functions in file_utils.py

Test plan

Uh oh!

mike-w-wilson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkgoodrich commented Mar 18, 2026 •

edited

Loading

New functions in `constraint.py`

Refactored functions in `constraint.py`

New functions in `file_utils.py`