-
Notifications
You must be signed in to change notification settings - Fork 32
Add per-variant constraint utility functions #824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jkgoodrich
wants to merge
28
commits into
main
Choose a base branch
from
jg/add_functions_for_per_base_constraint
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
7ad8b20
Merge branch 'jg/determine_end_trunc_filter_from_gerp' of https://git…
jkgoodrich 7a89505
Add several functions that are helpful for per base constraint
jkgoodrich f69d2f0
Fixes while testing
jkgoodrich 02e0e21
Allow _sum_agg_expr to take a StructExpression as input too
jkgoodrich a426cc4
modify `transform_grch38_methylation` to handle par regions in the sa…
jkgoodrich 5e5b0c9
Add mu to agg
jkgoodrich 2626118
Add filter to autosomes and par for coverage correction
jkgoodrich 476e364
Add coverage correction to the mu sum
jkgoodrich 7d85a75
Keep additional annotations in vep: "sift_score", "polyphen_score", …
jkgoodrich 1dc7df1
Add several functions that are helpful for per base constraint
jkgoodrich 26ba4d0
Merge branch 'main' of https://github.com/broadinstitute/gnomad_metho…
jkgoodrich bd74b03
Add change to import_gencode to include version
jkgoodrich 86c333b
Merge branch 'main' of https://github.com/broadinstitute/gnomad_metho…
jkgoodrich 5dc4cee
add codons to vep annotations to keep
jkgoodrich 0543031
Merge branch 'main' into jg/add_functions_for_per_base_constraint
jkgoodrich 8288e33
Enhance transcript filtering in add_gencode_transcript_annotations to…
jkgoodrich bbef83a
Add constraint functions for building consequence groups and calculat…
jkgoodrich eea1b21
Refactor single_variant_count_expr and add comprehensive tests for co…
jkgoodrich b0e2755
Add CLAUDE project reference and enhance constraint utilities
jkgoodrich a287db5
Refactor and enhance variant observation functions in constraint util…
jkgoodrich 54932fd
Refactor percentile threshold functions in constraint utilities
jkgoodrich 65f46ea
Add guidelines for maintaining CLAUDE.md
jkgoodrich 270921c
Add gnomAD package initialization with module documentation
jkgoodrich 8efabe3
Enhance error handling and logging in constraint utilities
jkgoodrich 74d2de5
Add skip condition for tests requiring hl.qgamma
jkgoodrich a50dcc6
Refactor rank_array_element_metrics to restore original keys and impr…
jkgoodrich 9d91d1f
Enhance rank_and_assign_bins and rank_array_element_metrics with pref…
jkgoodrich 592f646
Cache the result of rank_array_element_metrics to improve performance
jkgoodrich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,230 @@ | ||
| # gnomad_methods Project Reference | ||
|
|
||
| ## Project Overview | ||
|
|
||
| Shared Hail utility library for gnomAD pipelines. Provides reusable functions for constraint analysis, variant QC, sample QC, VEP processing, resource management, and general genomics operations. Used as a dependency by `gnomad_constraint`, `gnomad_qc`, `gnomad_mnv`, and other gnomAD repos. | ||
|
|
||
| ## Package Structure | ||
|
|
||
| | Directory | Purpose | | ||
| |-----------|---------| | ||
| | `gnomad/utils/constraint.py` | Constraint pipeline utilities (mutation rate, model building, o/e, pLI, z-scores) | | ||
| | `gnomad/utils/vep.py` | VEP annotation processing (consequence filtering, LOFTEE, MANE Select) | | ||
| | `gnomad/utils/annotations.py` | General variant annotations (age, frequency, quality) | | ||
| | `gnomad/utils/file_utils.py` | File I/O helpers, struct printing, array conversion | | ||
| | `gnomad/utils/filtering.py` | Frequency and variant filtering utilities | | ||
| | `gnomad/utils/gen_stats.py` | Statistical helper functions | | ||
| | `gnomad/utils/reference_genome.py` | Reference genome utilities | | ||
| | `gnomad/utils/release.py` | Release table formatting | | ||
| | `gnomad/utils/sparse_mt.py` | Sparse MatrixTable operations | | ||
| | `gnomad/resources/resource_utils.py` | Resource classes (`TableResource`, `VersionedTableResource`, etc.) | | ||
| | `gnomad/resources/grch38/gnomad.py` | GRCh38 resource paths and constants | | ||
| | `gnomad/resources/grch37/gnomad.py` | GRCh37 resource paths and constants | | ||
| | `gnomad/sample_qc/` | Sample QC (ancestry, relatedness, sex, platform) | | ||
| | `gnomad/variant_qc/` | Variant QC (random forest, evaluation, training) | | ||
| | `gnomad/assessment/` | Summary stats, validity checks | | ||
|
|
||
| ## Code Style | ||
|
|
||
| ### Formatting | ||
|
|
||
| Code is formatted with **black** (preview mode, line length 88), **isort** (profile `"black"`), and **autopep8** (aggressive=1, ignoring E201/E202/E203/E731). Linting uses **pylint** and **pydocstyle** (PEP 257 convention, ignoring D105/D107). Config is in `pyproject.toml`, `.pydocstylerc`, and `.pylintrc`. | ||
|
|
||
| Pre-commit hooks are installed. Run `pre-commit run --all-files` to check. | ||
|
|
||
| ```bash | ||
| # Manual formatting | ||
| black gnomad/ tests/ | ||
| isort --profile black --filter-files gnomad/ tests/ | ||
| ``` | ||
|
|
||
| ### Docstrings | ||
|
|
||
| Use **Sphinx-style** (`:param:`, `:return:`) docstrings: | ||
|
|
||
| - **Summary line**: Concise one-line description, followed by blank line. | ||
| - **Body**: Extended description if needed. Use `.. note::` for caveats. | ||
| - **Params**: `:param name: Description.` — include default values and when params are conditionally required. | ||
| - **Return**: `:return: Description.` — describe the structure, not just the type. | ||
| - **Code references**: Use double backticks (````field_name````). | ||
| - **Constants**: Document with a docstring on the line immediately after the assignment. | ||
|
|
||
| ```python | ||
| COVERAGE_CUTOFF = 30 | ||
| """Minimum median exome coverage differentiating high from low coverage sites.""" | ||
|
|
||
|
|
||
| def my_function( | ||
| ht: hl.Table, | ||
| max_af: Optional[float] = None, | ||
| ) -> hl.Table: | ||
| """Short summary of what the function does. | ||
|
|
||
| Extended description with more detail about behavior, edge cases, or | ||
| design decisions. | ||
|
|
||
| .. note:: | ||
|
|
||
| Any important caveats go here. | ||
|
|
||
| :param ht: Input Table with ``freq`` and ``context`` fields. | ||
| :param max_af: Maximum allele frequency threshold. Default is None (no | ||
| cutoff). | ||
| :return: Table with ``expected_variants`` and ``observed_variants`` fields | ||
| added. | ||
| """ | ||
| ``` | ||
|
|
||
| ### Type Annotations | ||
|
|
||
| - **All functions** must have type annotations on parameters and return values. | ||
| - Use `typing.List`, `typing.Optional`, `typing.Union`, `typing.Tuple`, `typing.Dict` for generic types. | ||
| - For Hail expressions, use the `hl.expr.*` prefix: `hl.expr.StructExpression`, `hl.expr.BooleanExpression`, `hl.expr.Float64Expression`, `hl.expr.Int32Expression`, `hl.expr.ArrayExpression`, `hl.expr.NumericExpression`, etc. | ||
| - For Hail table/matrix types: `hl.Table`, `hl.MatrixTable`. | ||
| - Use `Tuple[str, ...]` for variable-length tuples (not `Tuple[str]` which means exactly one element). | ||
| - Never use mutable defaults (`List`, `Dict`) in function signatures — use `None` and assign inside the function body. | ||
|
|
||
| ### Function Design | ||
|
|
||
| - **Expression-based where possible**: Prefer functions that take and return Hail expressions rather than Tables. This makes them composable and reusable in different contexts. Use `expr._indices.source` to recover the source Table when needed internally. | ||
| - **Single Table param named `ht`**: When a function takes one Table, name it `ht`. Use descriptive names only when taking multiple Tables (e.g., `mutation_ht`, `gencode_ht`). | ||
| - **Pure transformations**: Utils functions should be HTs in / HTs out. All file I/O (read, write, checkpoint) belongs in pipeline scripts, not utility functions. Exceptions exist for historical reasons but should not be added. | ||
| - **No lazy imports**: Always use top-level imports unless needed to resolve circular imports. | ||
| - **`Optional` for nullable params**: Always wrap with `Optional[...]` when the default is `None`. | ||
|
|
||
| ### Testing | ||
|
|
||
| - **Policy**: Any new or modified function in a PR must have tests. | ||
| - **Format**: Class-based tests with `pytest`. One test class per function. | ||
| - **Test data**: Use `hl.Table.parallelize()` to create small inline test tables. | ||
| - **Fixtures**: Use `@pytest.fixture` for shared setup. Hail init is handled by a session-scoped fixture in `tests/conftest.py`. | ||
| - **Docstrings**: Both test classes and test methods should have docstrings. | ||
|
|
||
| ```python | ||
| class TestMyFunction: | ||
| """Test the my_function function.""" | ||
|
|
||
| def test_basic_case(self): | ||
| """Test that basic input produces expected output.""" | ||
| ht = hl.Table.parallelize( | ||
| [{"x": 1, "y": 2.0}], | ||
| hl.tstruct(x=hl.tint32, y=hl.tfloat64), | ||
| ) | ||
| result = ht.annotate(z=my_function(ht.x, ht.y)).collect()[0] | ||
| assert result.z == 3.0 | ||
| ``` | ||
|
|
||
| ## Hail Best Practices | ||
|
|
||
| ### When to checkpoint vs cache | ||
|
|
||
| - **`checkpoint(new_temp_file(...))`**: Use for intermediate results that feed into multiple downstream operations or that follow expensive computations (joins, aggregations). Materializes to disk and breaks the query plan, preventing Hail from re-executing the upstream DAG. | ||
| - **`.cache()`**: Use for small results that will be reused immediately. Keeps data in memory but doesn't break the query plan as reliably. | ||
| - **After checkpoint, `.count()` is free**: It reads materialized metadata rather than re-executing the query. Place `.count()` after a checkpoint when you need the count. | ||
|
|
||
| ### Avoid `.count()` for logging | ||
|
|
||
| Never use `.count()` just to log how many rows a table has. On large tables this forces full materialization and can cause Spark shuffle failures. Only call `.count()` when the result is actually needed for computation. | ||
|
|
||
| ### `naive_coalesce()` after aggressive filters | ||
|
|
||
| When filtering a large table down to a small subset (e.g., LOFTEE HC LoF from all variants), most partitions become empty. This causes shuffle skew in downstream `group_by` aggregations. Call `.naive_coalesce(N)` after the filter to rebalance. | ||
|
|
||
| ### `_localize=False` for expression results | ||
|
|
||
| Use `ht.aggregate(expr, _localize=False)` when the result will be used as a Hail expression in downstream operations (e.g., annotating another table). This avoids collecting to Python and back. | ||
|
|
||
| ### Field existence checks | ||
|
|
||
| Use `field_name in ht.row` to check if a field exists on a Table. Hail Tables do not have `.get()`. | ||
|
|
||
| ### `hl.or_else`, `hl.or_missing`, `hl.is_defined` | ||
|
|
||
| - **`hl.or_else(expr, default)`**: Substitute `default` when `expr` is missing. | ||
| - **`hl.or_missing(condition, expr)`**: Return `expr` when `condition` is True, missing otherwise. | ||
| - **`hl.is_defined(expr)`**: Returns True/False (never missing) — no need to wrap in `hl.or_else`. | ||
| - **`divide_null(num, denom)`**: Safe division returning null when denominator is 0. Import from `hail.utils.misc`. | ||
|
|
||
| ### `approx_quantiles` is approximate | ||
|
|
||
| `hl.agg.approx_quantiles` uses the t-digest algorithm and returns approximate percentiles. Document this with a `.. note::` block when using it in functions. | ||
|
|
||
| ### Falsy value gotchas | ||
|
|
||
| When checking optional numeric parameters, always use `is not None` instead of truthiness checks. `if max_af:` will skip `max_af=0.0`, which is a valid value. Same applies to any numeric parameter where 0 is meaningful. | ||
|
|
||
| ### Array schema uniformity | ||
|
|
||
| All elements of a Hail array field must have identical struct schemas. You cannot annotate only `array[0]` with extra fields while leaving `array[1+]` unchanged — Hail will reject the mixed schema. Promote such metadata to the parent struct level instead. | ||
|
|
||
| ### Rank assignment with `order_by` | ||
|
|
||
| `ht.order_by(expr)` destroys the key. To rejoin ranked results: | ||
| 1. `ht.add_index("_rank_idx")` before ordering. | ||
| 2. `rank_ht.key_by("_rank_idx")` after ordering. | ||
| 3. Use `hl.scan.count()` to assign 0-based ascending ranks: `ht.order_by(val).annotate(rank=hl.scan.count())`. | ||
|
|
||
| ### Small table reconstruction | ||
|
|
||
| `hl.Table.parallelize(hl.eval(ht.my_array_global), schema=...)` reconstructs a small Hail Table from a global array without re-running any jobs. | ||
|
|
||
| ## Key Resource Classes | ||
|
|
||
| ```python | ||
| from gnomad.resources.resource_utils import ( | ||
| TableResource, # .ht() to read | ||
| MatrixTableResource, # .mt() to read | ||
| VersionedTableResource, # .ht() with version param | ||
| VariantDatasetResource, # .vds() to read | ||
| ) | ||
| ``` | ||
|
|
||
| Usage: | ||
| ```python | ||
| resource = TableResource(path="gs://gnomad/v4.1/constraint/metrics.ht") | ||
| ht = resource.ht() # Reads the table | ||
| ``` | ||
|
|
||
| ## Key Constraint API | ||
|
|
||
| ```python | ||
| from gnomad.utils.constraint import ( | ||
| # Mutation rate | ||
| annotate_mutation_type, | ||
| annotate_with_mu, | ||
| calibration_model_group_expr, | ||
| # Consequence grouping | ||
| build_constraint_consequence_groups, | ||
| # Counting | ||
| single_variant_count_expr, | ||
| single_variant_observed_and_possible_expr, | ||
| count_observed_and_possible_by_group, | ||
| get_counts_agg_expr, | ||
| # Model building & application | ||
| build_models, | ||
| apply_models, | ||
| aggregate_expected_variants_expr, | ||
| # GERP | ||
| calculate_gerp_cutoffs, | ||
| # Constraint metrics | ||
| compute_pli, | ||
| oe_confidence_interval, | ||
| calculate_raw_z_score, | ||
| calculate_raw_z_score_sd, | ||
| get_constraint_flags, | ||
| # Ranking & binning | ||
| rank_and_assign_bins, | ||
| compute_percentile_thresholds, | ||
| annotate_bins_by_threshold, | ||
| ) | ||
| ``` | ||
|
|
||
| ## Maintaining CLAUDE.md | ||
|
|
||
| When working on any gnomAD repo that has a CLAUDE.md file, proactively add useful information you discover during development — gotchas, non-obvious API behavior, schema quirks, resource path conventions, or anything else that would save a future developer (or Claude session) time. Keep additions concise and placed in the appropriate section. | ||
|
|
||
| ## CI/CD | ||
|
|
||
| - **Pre-commit hooks**: black, autopep8, pydocstyle, isort (run automatically on commit) | ||
| - **CI** (`.github/workflows/ci.yml`): Runs on push to main and PRs — black check, isort check, pydocstyle, autopep8 check, pylint, pytest | ||
| - **Publishing** (`.github/workflows/publish.yml`): Triggered by version tags (`v[0-9]+.[0-9]+.[0-9]+`) — validates version, runs tests, publishes to PyPI | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| """gnomAD utilities and resources package.""" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. didn't we exclude this file before to not interfere with pypi? |
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets separate this out of this constraint PR and create different CLAUDE PR