Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7ad8b20
Merge branch 'jg/determine_end_trunc_filter_from_gerp' of https://git…
jkgoodrich Feb 7, 2025
7a89505
Add several functions that are helpful for per base constraint
jkgoodrich Mar 10, 2025
f69d2f0
Fixes while testing
jkgoodrich Mar 12, 2025
02e0e21
Allow _sum_agg_expr to take a StructExpression as input too
jkgoodrich Mar 12, 2025
a426cc4
modify `transform_grch38_methylation` to handle par regions in the sa…
jkgoodrich Mar 13, 2025
5e5b0c9
Add mu to agg
jkgoodrich Mar 14, 2025
2626118
Add filter to autosomes and par for coverage correction
jkgoodrich Mar 14, 2025
476e364
Add coverage correction to the mu sum
jkgoodrich Mar 14, 2025
7d85a75
Keep additional annotations in vep: "sift_score", "polyphen_score", …
jkgoodrich Sep 10, 2025
1dc7df1
Add several functions that are helpful for per base constraint
jkgoodrich Mar 10, 2025
26ba4d0
Merge branch 'main' of https://github.com/broadinstitute/gnomad_metho…
jkgoodrich Sep 15, 2025
bd74b03
Add change to import_gencode to include version
jkgoodrich Sep 24, 2025
86c333b
Merge branch 'main' of https://github.com/broadinstitute/gnomad_metho…
jkgoodrich Sep 24, 2025
5dc4cee
add codons to vep annotations to keep
jkgoodrich Jan 7, 2026
0543031
Merge branch 'main' into jg/add_functions_for_per_base_constraint
jkgoodrich Feb 15, 2026
8288e33
Enhance transcript filtering in add_gencode_transcript_annotations to…
jkgoodrich Mar 17, 2026
bbef83a
Add constraint functions for building consequence groups and calculat…
jkgoodrich Mar 17, 2026
eea1b21
Refactor single_variant_count_expr and add comprehensive tests for co…
jkgoodrich Mar 17, 2026
b0e2755
Add CLAUDE project reference and enhance constraint utilities
jkgoodrich Mar 18, 2026
a287db5
Refactor and enhance variant observation functions in constraint util…
jkgoodrich Mar 18, 2026
54932fd
Refactor percentile threshold functions in constraint utilities
jkgoodrich Mar 18, 2026
65f46ea
Add guidelines for maintaining CLAUDE.md
jkgoodrich Mar 18, 2026
270921c
Add gnomAD package initialization with module documentation
jkgoodrich Mar 18, 2026
8efabe3
Enhance error handling and logging in constraint utilities
jkgoodrich Mar 18, 2026
74d2de5
Add skip condition for tests requiring hl.qgamma
jkgoodrich Mar 18, 2026
a50dcc6
Refactor rank_array_element_metrics to restore original keys and impr…
jkgoodrich Mar 19, 2026
9d91d1f
Enhance rank_and_assign_bins and rank_array_element_metrics with pref…
jkgoodrich Mar 23, 2026
592f646
Cache the result of rank_array_element_metrics to improve performance
jkgoodrich Mar 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 230 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# gnomad_methods Project Reference
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets separate this out of this constraint PR and create different CLAUDE PR


## Project Overview

Shared Hail utility library for gnomAD pipelines. Provides reusable functions for constraint analysis, variant QC, sample QC, VEP processing, resource management, and general genomics operations. Used as a dependency by `gnomad_constraint`, `gnomad_qc`, `gnomad_mnv`, and other gnomAD repos.

## Package Structure

| Directory | Purpose |
|-----------|---------|
| `gnomad/utils/constraint.py` | Constraint pipeline utilities (mutation rate, model building, o/e, pLI, z-scores) |
| `gnomad/utils/vep.py` | VEP annotation processing (consequence filtering, LOFTEE, MANE Select) |
| `gnomad/utils/annotations.py` | General variant annotations (age, frequency, quality) |
| `gnomad/utils/file_utils.py` | File I/O helpers, struct printing, array conversion |
| `gnomad/utils/filtering.py` | Frequency and variant filtering utilities |
| `gnomad/utils/gen_stats.py` | Statistical helper functions |
| `gnomad/utils/reference_genome.py` | Reference genome utilities |
| `gnomad/utils/release.py` | Release table formatting |
| `gnomad/utils/sparse_mt.py` | Sparse MatrixTable operations |
| `gnomad/resources/resource_utils.py` | Resource classes (`TableResource`, `VersionedTableResource`, etc.) |
| `gnomad/resources/grch38/gnomad.py` | GRCh38 resource paths and constants |
| `gnomad/resources/grch37/gnomad.py` | GRCh37 resource paths and constants |
| `gnomad/sample_qc/` | Sample QC (ancestry, relatedness, sex, platform) |
| `gnomad/variant_qc/` | Variant QC (random forest, evaluation, training) |
| `gnomad/assessment/` | Summary stats, validity checks |

## Code Style

### Formatting

Code is formatted with **black** (preview mode, line length 88), **isort** (profile `"black"`), and **autopep8** (aggressive=1, ignoring E201/E202/E203/E731). Linting uses **pylint** and **pydocstyle** (PEP 257 convention, ignoring D105/D107). Config is in `pyproject.toml`, `.pydocstylerc`, and `.pylintrc`.

Pre-commit hooks are installed. Run `pre-commit run --all-files` to check.

```bash
# Manual formatting
black gnomad/ tests/
isort --profile black --filter-files gnomad/ tests/
```

### Docstrings

Use **Sphinx-style** (`:param:`, `:return:`) docstrings:

- **Summary line**: Concise one-line description, followed by blank line.
- **Body**: Extended description if needed. Use `.. note::` for caveats.
- **Params**: `:param name: Description.` — include default values and when params are conditionally required.
- **Return**: `:return: Description.` — describe the structure, not just the type.
- **Code references**: Use double backticks (````field_name````).
- **Constants**: Document with a docstring on the line immediately after the assignment.

```python
COVERAGE_CUTOFF = 30
"""Minimum median exome coverage differentiating high from low coverage sites."""


def my_function(
ht: hl.Table,
max_af: Optional[float] = None,
) -> hl.Table:
"""Short summary of what the function does.

Extended description with more detail about behavior, edge cases, or
design decisions.

.. note::

Any important caveats go here.

:param ht: Input Table with ``freq`` and ``context`` fields.
:param max_af: Maximum allele frequency threshold. Default is None (no
cutoff).
:return: Table with ``expected_variants`` and ``observed_variants`` fields
added.
"""
```

### Type Annotations

- **All functions** must have type annotations on parameters and return values.
- Use `typing.List`, `typing.Optional`, `typing.Union`, `typing.Tuple`, `typing.Dict` for generic types.
- For Hail expressions, use the `hl.expr.*` prefix: `hl.expr.StructExpression`, `hl.expr.BooleanExpression`, `hl.expr.Float64Expression`, `hl.expr.Int32Expression`, `hl.expr.ArrayExpression`, `hl.expr.NumericExpression`, etc.
- For Hail table/matrix types: `hl.Table`, `hl.MatrixTable`.
- Use `Tuple[str, ...]` for variable-length tuples (not `Tuple[str]` which means exactly one element).
- Never use mutable defaults (`List`, `Dict`) in function signatures — use `None` and assign inside the function body.

### Function Design

- **Expression-based where possible**: Prefer functions that take and return Hail expressions rather than Tables. This makes them composable and reusable in different contexts. Use `expr._indices.source` to recover the source Table when needed internally.
- **Single Table param named `ht`**: When a function takes one Table, name it `ht`. Use descriptive names only when taking multiple Tables (e.g., `mutation_ht`, `gencode_ht`).
- **Pure transformations**: Utils functions should be HTs in / HTs out. All file I/O (read, write, checkpoint) belongs in pipeline scripts, not utility functions. Exceptions exist for historical reasons but should not be added.
- **No lazy imports**: Always use top-level imports unless needed to resolve circular imports.
- **`Optional` for nullable params**: Always wrap with `Optional[...]` when the default is `None`.

### Testing

- **Policy**: Any new or modified function in a PR must have tests.
- **Format**: Class-based tests with `pytest`. One test class per function.
- **Test data**: Use `hl.Table.parallelize()` to create small inline test tables.
- **Fixtures**: Use `@pytest.fixture` for shared setup. Hail init is handled by a session-scoped fixture in `tests/conftest.py`.
- **Docstrings**: Both test classes and test methods should have docstrings.

```python
class TestMyFunction:
"""Test the my_function function."""

def test_basic_case(self):
"""Test that basic input produces expected output."""
ht = hl.Table.parallelize(
[{"x": 1, "y": 2.0}],
hl.tstruct(x=hl.tint32, y=hl.tfloat64),
)
result = ht.annotate(z=my_function(ht.x, ht.y)).collect()[0]
assert result.z == 3.0
```

## Hail Best Practices

### When to checkpoint vs cache

- **`checkpoint(new_temp_file(...))`**: Use for intermediate results that feed into multiple downstream operations or that follow expensive computations (joins, aggregations). Materializes to disk and breaks the query plan, preventing Hail from re-executing the upstream DAG.
- **`.cache()`**: Use for small results that will be reused immediately. Keeps data in memory but doesn't break the query plan as reliably.
- **After checkpoint, `.count()` is free**: It reads materialized metadata rather than re-executing the query. Place `.count()` after a checkpoint when you need the count.

### Avoid `.count()` for logging

Never use `.count()` just to log how many rows a table has. On large tables this forces full materialization and can cause Spark shuffle failures. Only call `.count()` when the result is actually needed for computation.

### `naive_coalesce()` after aggressive filters

When filtering a large table down to a small subset (e.g., LOFTEE HC LoF from all variants), most partitions become empty. This causes shuffle skew in downstream `group_by` aggregations. Call `.naive_coalesce(N)` after the filter to rebalance.

### `_localize=False` for expression results

Use `ht.aggregate(expr, _localize=False)` when the result will be used as a Hail expression in downstream operations (e.g., annotating another table). This avoids collecting to Python and back.

### Field existence checks

Use `field_name in ht.row` to check if a field exists on a Table. Hail Tables do not have `.get()`.

### `hl.or_else`, `hl.or_missing`, `hl.is_defined`

- **`hl.or_else(expr, default)`**: Substitute `default` when `expr` is missing.
- **`hl.or_missing(condition, expr)`**: Return `expr` when `condition` is True, missing otherwise.
- **`hl.is_defined(expr)`**: Returns True/False (never missing) — no need to wrap in `hl.or_else`.
- **`divide_null(num, denom)`**: Safe division returning null when denominator is 0. Import from `hail.utils.misc`.

### `approx_quantiles` is approximate

`hl.agg.approx_quantiles` uses the t-digest algorithm and returns approximate percentiles. Document this with a `.. note::` block when using it in functions.

### Falsy value gotchas

When checking optional numeric parameters, always use `is not None` instead of truthiness checks. `if max_af:` will skip `max_af=0.0`, which is a valid value. Same applies to any numeric parameter where 0 is meaningful.

### Array schema uniformity

All elements of a Hail array field must have identical struct schemas. You cannot annotate only `array[0]` with extra fields while leaving `array[1+]` unchanged — Hail will reject the mixed schema. Promote such metadata to the parent struct level instead.

### Rank assignment with `order_by`

`ht.order_by(expr)` destroys the key. To rejoin ranked results:
1. `ht.add_index("_rank_idx")` before ordering.
2. `rank_ht.key_by("_rank_idx")` after ordering.
3. Use `hl.scan.count()` to assign 0-based ascending ranks: `ht.order_by(val).annotate(rank=hl.scan.count())`.

### Small table reconstruction

`hl.Table.parallelize(hl.eval(ht.my_array_global), schema=...)` reconstructs a small Hail Table from a global array without re-running any jobs.

## Key Resource Classes

```python
from gnomad.resources.resource_utils import (
TableResource, # .ht() to read
MatrixTableResource, # .mt() to read
VersionedTableResource, # .ht() with version param
VariantDatasetResource, # .vds() to read
)
```

Usage:
```python
resource = TableResource(path="gs://gnomad/v4.1/constraint/metrics.ht")
ht = resource.ht() # Reads the table
```

## Key Constraint API

```python
from gnomad.utils.constraint import (
# Mutation rate
annotate_mutation_type,
annotate_with_mu,
calibration_model_group_expr,
# Consequence grouping
build_constraint_consequence_groups,
# Counting
single_variant_count_expr,
single_variant_observed_and_possible_expr,
count_observed_and_possible_by_group,
get_counts_agg_expr,
# Model building & application
build_models,
apply_models,
aggregate_expected_variants_expr,
# GERP
calculate_gerp_cutoffs,
# Constraint metrics
compute_pli,
oe_confidence_interval,
calculate_raw_z_score,
calculate_raw_z_score_sd,
get_constraint_flags,
# Ranking & binning
rank_and_assign_bins,
compute_percentile_thresholds,
annotate_bins_by_threshold,
)
```

## Maintaining CLAUDE.md

When working on any gnomAD repo that has a CLAUDE.md file, proactively add useful information you discover during development — gotchas, non-obvious API behavior, schema quirks, resource path conventions, or anything else that would save a future developer (or Claude session) time. Keep additions concise and placed in the appropriate section.

## CI/CD

- **Pre-commit hooks**: black, autopep8, pydocstyle, isort (run automatically on commit)
- **CI** (`.github/workflows/ci.yml`): Runs on push to main and PRs — black check, isort check, pydocstyle, autopep8 check, pylint, pytest
- **Publishing** (`.github/workflows/publish.yml`): Triggered by version tags (`v[0-9]+.[0-9]+.[0-9]+`) — validates version, runs tests, publishes to PyPI
1 change: 1 addition & 0 deletions gnomad/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""gnomAD utilities and resources package."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we exclude this file before to not interfere with pypi?

Loading
Loading