Shared Hail utility library for gnomAD pipelines. Provides reusable functions for constraint analysis, variant QC, sample QC, VEP processing, resource management, and general genomics operations. Used as a dependency by gnomad_constraint, gnomad_qc, gnomad_mnv, and other gnomAD repos.
| Directory | Purpose |
|---|---|
gnomad/utils/constraint.py |
Constraint pipeline utilities (mutation rate, model building, o/e, pLI, z-scores) |
gnomad/utils/vep.py |
VEP annotation processing (consequence filtering, LOFTEE, MANE Select) |
gnomad/utils/annotations.py |
General variant annotations (age, frequency, quality) |
gnomad/utils/file_utils.py |
File I/O helpers, struct printing, array conversion |
gnomad/utils/filtering.py |
Frequency and variant filtering utilities |
gnomad/utils/gen_stats.py |
Statistical helper functions |
gnomad/utils/reference_genome.py |
Reference genome utilities |
gnomad/utils/release.py |
Release table formatting |
gnomad/utils/sparse_mt.py |
Sparse MatrixTable operations |
gnomad/resources/resource_utils.py |
Resource classes (TableResource, VersionedTableResource, etc.) |
gnomad/resources/grch38/gnomad.py |
GRCh38 resource paths and constants |
gnomad/resources/grch37/gnomad.py |
GRCh37 resource paths and constants |
gnomad/sample_qc/ |
Sample QC (ancestry, relatedness, sex, platform) |
gnomad/variant_qc/ |
Variant QC (random forest, evaluation, training) |
gnomad/assessment/ |
Summary stats, validity checks |
Code is formatted with black (preview mode, line length 88), isort (profile "black"), and autopep8 (aggressive=1, ignoring E201/E202/E203/E731). Linting uses pylint and pydocstyle (PEP 257 convention, ignoring D105/D107). Config is in pyproject.toml, .pydocstylerc, and .pylintrc.
Pre-commit hooks are installed. Run pre-commit run --all-files to check.
# Manual formatting
black gnomad/ tests/
isort --profile black --filter-files gnomad/ tests/Use Sphinx-style (:param:, :return:) docstrings:
- Summary line: Concise one-line description, followed by blank line.
- Body: Extended description if needed. Use
.. note::for caveats. - Params:
:param name: Description.— include default values and when params are conditionally required. - Return:
:return: Description.— describe the structure, not just the type. - Code references: Use double backticks (
field_name). - Constants: Document with a docstring on the line immediately after the assignment.
COVERAGE_CUTOFF = 30
"""Minimum median exome coverage differentiating high from low coverage sites."""
def my_function(
ht: hl.Table,
max_af: Optional[float] = None,
) -> hl.Table:
"""Short summary of what the function does.
Extended description with more detail about behavior, edge cases, or
design decisions.
.. note::
Any important caveats go here.
:param ht: Input Table with ``freq`` and ``context`` fields.
:param max_af: Maximum allele frequency threshold. Default is None (no
cutoff).
:return: Table with ``expected_variants`` and ``observed_variants`` fields
added.
"""- All functions must have type annotations on parameters and return values.
- Use
typing.List,typing.Optional,typing.Union,typing.Tuple,typing.Dictfor generic types. - For Hail expressions, use the
hl.expr.*prefix:hl.expr.StructExpression,hl.expr.BooleanExpression,hl.expr.Float64Expression,hl.expr.Int32Expression,hl.expr.ArrayExpression,hl.expr.NumericExpression, etc. - For Hail table/matrix types:
hl.Table,hl.MatrixTable. - Use
Tuple[str, ...]for variable-length tuples (notTuple[str]which means exactly one element). - Never use mutable defaults (
List,Dict) in function signatures — useNoneand assign inside the function body.
- Expression-based where possible: Prefer functions that take and return Hail expressions rather than Tables. This makes them composable and reusable in different contexts. Use
expr._indices.sourceto recover the source Table when needed internally. - Single Table param named
ht: When a function takes one Table, name itht. Use descriptive names only when taking multiple Tables (e.g.,mutation_ht,gencode_ht). - Pure transformations: Utils functions should be HTs in / HTs out. All file I/O (read, write, checkpoint) belongs in pipeline scripts, not utility functions. Exceptions exist for historical reasons but should not be added.
- No lazy imports: Always use top-level imports unless needed to resolve circular imports.
Optionalfor nullable params: Always wrap withOptional[...]when the default isNone.
- Policy: Any new or modified function in a PR must have tests.
- Format: Class-based tests with
pytest. One test class per function. - Test data: Use
hl.Table.parallelize()to create small inline test tables. - Fixtures: Use
@pytest.fixturefor shared setup. Hail init is handled by a session-scoped fixture intests/conftest.py. - Docstrings: Both test classes and test methods should have docstrings.
class TestMyFunction:
"""Test the my_function function."""
def test_basic_case(self):
"""Test that basic input produces expected output."""
ht = hl.Table.parallelize(
[{"x": 1, "y": 2.0}],
hl.tstruct(x=hl.tint32, y=hl.tfloat64),
)
result = ht.annotate(z=my_function(ht.x, ht.y)).collect()[0]
assert result.z == 3.0checkpoint(new_temp_file(...)): Use for intermediate results that feed into multiple downstream operations or that follow expensive computations (joins, aggregations). Materializes to disk and breaks the query plan, preventing Hail from re-executing the upstream DAG..cache(): Use for small results that will be reused immediately. Keeps data in memory but doesn't break the query plan as reliably.- After checkpoint,
.count()is free: It reads materialized metadata rather than re-executing the query. Place.count()after a checkpoint when you need the count.
Never use .count() just to log how many rows a table has. On large tables this forces full materialization and can cause Spark shuffle failures. Only call .count() when the result is actually needed for computation.
When filtering a large table down to a small subset (e.g., LOFTEE HC LoF from all variants), most partitions become empty. This causes shuffle skew in downstream group_by aggregations. Call .naive_coalesce(N) after the filter to rebalance.
Use ht.aggregate(expr, _localize=False) when the result will be used as a Hail expression in downstream operations (e.g., annotating another table). This avoids collecting to Python and back.
Use field_name in ht.row to check if a field exists on a Table. Hail Tables do not have .get().
hl.or_else(expr, default): Substitutedefaultwhenexpris missing.hl.or_missing(condition, expr): Returnexprwhenconditionis True, missing otherwise.hl.is_defined(expr): Returns True/False (never missing) — no need to wrap inhl.or_else.divide_null(num, denom): Safe division returning null when denominator is 0. Import fromhail.utils.misc.
hl.agg.approx_quantiles uses the t-digest algorithm and returns approximate percentiles. Document this with a .. note:: block when using it in functions.
When checking optional numeric parameters, always use is not None instead of truthiness checks. if max_af: will skip max_af=0.0, which is a valid value. Same applies to any numeric parameter where 0 is meaningful.
All elements of a Hail array field must have identical struct schemas. You cannot annotate only array[0] with extra fields while leaving array[1+] unchanged — Hail will reject the mixed schema. Promote such metadata to the parent struct level instead.
ht.order_by(expr) destroys the key. To rejoin ranked results:
ht.add_index("_rank_idx")before ordering.rank_ht.key_by("_rank_idx")after ordering.- Use
hl.scan.count()to assign 0-based ascending ranks:ht.order_by(val).annotate(rank=hl.scan.count()).
hl.Table.parallelize(hl.eval(ht.my_array_global), schema=...) reconstructs a small Hail Table from a global array without re-running any jobs.
from gnomad.resources.resource_utils import (
TableResource, # .ht() to read
MatrixTableResource, # .mt() to read
VersionedTableResource, # .ht() with version param
VariantDatasetResource, # .vds() to read
)Usage:
resource = TableResource(path="gs://gnomad/v4.1/constraint/metrics.ht")
ht = resource.ht() # Reads the tablefrom gnomad.utils.constraint import (
# Mutation rate
annotate_mutation_type,
annotate_with_mu,
calibration_model_group_expr,
# Consequence grouping
build_constraint_consequence_groups,
# Counting
single_variant_count_expr,
single_variant_observed_and_possible_expr,
count_observed_and_possible_by_group,
get_counts_agg_expr,
# Model building & application
build_models,
apply_models,
aggregate_expected_variants_expr,
# GERP
calculate_gerp_cutoffs,
# Constraint metrics
compute_pli,
oe_confidence_interval,
calculate_raw_z_score,
calculate_raw_z_score_sd,
get_constraint_flags,
# Ranking & binning
rank_and_assign_bins,
compute_percentile_thresholds,
annotate_bins_by_threshold,
)When working on any gnomAD repo that has a CLAUDE.md file, proactively add useful information you discover during development — gotchas, non-obvious API behavior, schema quirks, resource path conventions, or anything else that would save a future developer (or Claude session) time. Keep additions concise and placed in the appropriate section.
- Pre-commit hooks: black, autopep8, pydocstyle, isort (run automatically on commit)
- CI (
.github/workflows/ci.yml): Runs on push to main and PRs — black check, isort check, pydocstyle, autopep8 check, pylint, pytest - Publishing (
.github/workflows/publish.yml): Triggered by version tags (v[0-9]+.[0-9]+.[0-9]+) — validates version, runs tests, publishes to PyPI