Skip to content

Commit 2b327c1

Browse files
authored
Merge pull request #149 from bigbio/dev
Integrate dev → main: plugin system + 1000 Genomes + Wave 1 (#119#123)
2 parents d255d33 + 3e1d216 commit 2b327c1

164 files changed

Lines changed: 5258 additions & 3225 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,13 +60,14 @@ everyone depends on — neither imports upward.
6060

6161
### Data model
6262

63-
Three semantic artifact types live in [`hvantk/core/models/`](hvantk/core/models/),
63+
Four semantic artifact types live in [`hvantk/core/models/`](hvantk/core/models/),
6464
each backed by one of several native engines:
6565

6666
| Artifact | Backends | On-disk format | Used for |
6767
|---|---|---|---|
6868
| [`AnnotationTable`](hvantk/core/models/annotation_table.py) | `hail` / `pandas` | `.ht/` or `.parquet` | variants, gene-disease pairs, eQTLs, PTM sites |
69-
| [`ExpressionMatrix`](hvantk/core/models/expression_matrix.py) | `anndata` / `hail-mt` | `.h5ad` or `.mt/` | bulk + single-cell expression, proteomics matrices |
69+
| [`ExpressionMatrix`](hvantk/core/models/expression_matrix.py) | `anndata` | `.h5ad` | bulk + single-cell expression, proteomics matrices |
70+
| [`VariantMatrix`](hvantk/core/models/variant_matrix.py) | `hail-mt` | `.mt/` | multi-sample variant cohorts (genotypes × samples × multi-field entries) |
7071
| [`GeneSet`](hvantk/core/models/gene_set.py) | (in-memory `frozenset`) | `.geneset.json` | curated gene collections (CHD, MSigDB, …) |
7172

7273
Every artifact carries a [`Provenance`](hvantk/core/models/provenance.py)
@@ -130,7 +131,7 @@ Twenty plugins ship today: `clinvar`, `clingen`, `gencc`, `gwas-catalog`,
130131
```
131132
hvantk/
132133
├── core/ # platform substrate — stable contracts
133-
│ ├── models/ # AnnotationTable, ExpressionMatrix, GeneSet,
134+
│ ├── models/ # AnnotationTable, ExpressionMatrix, VariantMatrix, GeneSet,
134135
│ │ # Provenance, BuildContext, Expr DSL,
135136
│ │ # AlgorithmMeta (@algorithm decorator)
136137
│ ├── io/ # save / load / save_native / load_native,
@@ -163,7 +164,6 @@ hvantk/
163164
164165
├── tools/ # CLI wiring + workflow orchestration
165166
│ ├── plugins/ # download, drift, reprocess, plugins list
166-
│ ├── build/ # build_1k_genome (standalone reference panel build)
167167
│ ├── hgc/ # joint-genotyping CLI (lazy-loaded)
168168
│ ├── ancestry/, enrichex/, expression/, ptm/, qtl/, infra/, genesets/
169169
│ └── tools_cli.py # tool registry inspection

docs_site/architecture.md

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -43,12 +43,11 @@ hvantk/
4343
│ ├── config.py # Configuration management
4444
│ ├── constants.py # Shared constants
4545
│ ├── protocols.py # Protocol definitions (Builder, Streamer, Downloader)
46-
│ ├── builders/ # Generic builder helpers
47-
│ │ └── table.py # _create_table_base, _cleanup_temp_file, etc.
4846
│ ├── io/ # Artifact loader (load/save Hail Tables, AnnData, etc.)
4947
│ ├── models/ # Domain model types
5048
│ │ ├── annotation_table.py # AnnotationTable artifact
51-
│ │ ├── expression_matrix.py # ExpressionMatrix artifact
49+
│ │ ├── expression_matrix.py # ExpressionMatrix artifact (AnnData-only)
50+
│ │ ├── variant_matrix.py # VariantMatrix artifact (Hail MatrixTable)
5251
│ │ ├── gene_set.py # GeneSet artifact
5352
│ │ ├── artifact.py # Artifact base + type registry
5453
│ │ ├── backends.py # AlgorithmMeta, Backend, @algorithm decorator
@@ -143,19 +142,15 @@ The codebase is organized by function and biological domain:
143142
**Data Builders** (`skills/<provider>/builder.py`):
144143
- Each plugin under `hvantk/skills/` owns its Phase B builder
145144
(`build_<provider>_<dataset>`). Builders return `AnnotationTable`,
146-
`ExpressionMatrix`, or `GeneSet` artifacts, stamped with `Provenance` by
147-
the platform via `run_builder_for_spec`.
145+
`ExpressionMatrix`, `VariantMatrix`, or `GeneSet` artifacts, stamped
146+
with `Provenance` by the platform via `run_builder_for_spec`.
148147
- `hvantk reprocess <provider>:<dataset>` is the **only** public build
149148
path. There is no separate programmatic API; in-process callers that
150149
need to build a table inside a tool/pipeline invoke
151150
`hvantk.core.plugin.run_builder.run_builder_for_spec` directly.
152151
- Generic Hail helpers live in `hvantk/core/utils/hail_helpers.py`
153152
(`create_table_base`, `cleanup_temp_file`); QTL-shared helpers in
154153
`hvantk/core/utils/qtl_helpers.py`.
155-
- **Exception:** the 1000 Genomes builder
156-
(`hvantk/core/builders/genome.py`) is the one remaining non-plugin
157-
builder, tracked by [#116] for plugin migration. Until then, the legacy
158-
`hvantk build-1k-genome` CLI remains the entry point for that dataset.
159154

160155
**Analysis Pipelines** (separate modules):
161156
- `hgc/` - Joint genotyping and cohort analysis
@@ -181,7 +176,8 @@ Every data product is one of three semantic artifact types in
181176
| Artifact | Backends | On-disk format | Used for |
182177
|---|---|---|---|
183178
| `AnnotationTable` | `hail` / `pandas` | `.ht/` or `.parquet` | variants, gene-disease pairs, eQTLs, PTM sites |
184-
| `ExpressionMatrix` | `anndata` / `hail-mt` | `.h5ad` or `.mt/` | bulk + single-cell expression, proteomics matrices |
179+
| `ExpressionMatrix` | `anndata` | `.h5ad` | bulk + single-cell expression, proteomics matrices |
180+
| `VariantMatrix` | `hail-mt` | `.mt/` | multi-sample variant cohorts (genotypes × samples × multi-field entries) |
185181
| `GeneSet` | (in-memory `frozenset`) | `.geneset.json` | curated gene collections |
186182

187183
Each artifact carries a `Provenance` record (plugin, version, source
@@ -401,7 +397,7 @@ annotated = variants.annotate(
401397
- `constants.py` - Shared constants (e.g., Ensembl field definitions)
402398
- `utils/hail_context.py` - Hail session initialization and management
403399
- `protocols.py` - Protocol definitions for extensibility
404-
- `models/` - Domain artifact types (`AnnotationTable`, `ExpressionMatrix`, `GeneSet`)
400+
- `models/` - Domain artifact types (`AnnotationTable`, `ExpressionMatrix`, `VariantMatrix`, `GeneSet`)
405401
- `plugin/` - Plugin schema (`api.py`), discovery (`loader.py`), and builder dispatch (`run_builder.py`)
406402

407403
**Design principle**: No domain logic, only infrastructure
@@ -410,11 +406,12 @@ annotated = variants.annotate(
410406

411407
**Purpose**: Per-provider data plugins. Each provider folder contains `plugin.yaml`, `builder.py`, `cli.py`, `drift_probe.py`, `SKILL.md`, `catalog/datasets.json`, and `tests/`. Multi-dataset providers (e.g., `cptac/`) have one sub-folder per dataset.
412408

413-
**Current providers** (20): `alphagenome`, `clingen`, `clinvar`, `cosmic_cgc`, `cptac`, `dbnsfp`, `ensembl_gene`, `expression_atlas`, `gencc`, `gevir`, `gnomad_metrics`, `gtex_eqtl`, `gwas_catalog`, `hgnc`, `insider`, `msigdb`, `peptideatlas`, `pqtl`, `ucsc_cellbrowser`, `uniprot_ptm`.
409+
**Current providers** (21): `alphagenome`, `clingen`, `clinvar`, `cosmic_cgc`, `cptac`, `dbnsfp`, `ensembl_gene`, `expression_atlas`, `gencc`, `gevir`, `gnomad_metrics`, `gtex_eqtl`, `gwas_catalog`, `hgnc`, `insider`, `msigdb`, `onek_genomes`, `peptideatlas`, `pqtl`, `ucsc_cellbrowser`, `uniprot_ptm`.
414410

415411
**Builder outputs**:
416412
- Variant / gene tables keyed by `(locus, alleles)` or `gene_id``AnnotationTable`
417413
- Expression matrices rows=genes, columns=samples/cells → `ExpressionMatrix`
414+
- Multi-sample variant cohorts (variants × samples × genotypes) → `VariantMatrix`
418415
- Gene set collections → `GeneSet`
419416

420417
### Tools Module (`tools/`)
@@ -459,7 +456,7 @@ Peer of `core/`, not a layer above it.
459456
| `resources/schemas/` | JSON schemas that describe catalog / dataset metadata, shared across plugins. |
460457
| `resources/unified_registry.py` | Code that aggregates per-plugin catalog JSON with the legacy registry. |
461458
| `skills/<provider>/catalog/datasets.json` | Per-plugin dataset metadata (the canonical location for new providers). |
462-
| `core/models/` | Artifact types (`AnnotationTable`, `ExpressionMatrix`, `GeneSet`) — runtime data shapes, not catalog metadata. |
459+
| `core/models/` | Artifact types (`AnnotationTable`, `ExpressionMatrix`, `VariantMatrix`, `GeneSet`) — runtime data shapes, not catalog metadata. |
463460

464461
**Dependency direction**: `resources/` may be imported by `algorithms/`,
465462
`skills/`, and `tools/`. It must NOT import from any of those — like
@@ -490,7 +487,7 @@ See the "Plugin Contract" section above for the full pattern. The minimal
490487
checklist:
491488

492489
1. Create `hvantk/skills/<provider>/` with `plugin.yaml`, `builder.py` (returns
493-
`AnnotationTable` / `ExpressionMatrix` / `GeneSet` via `ctx.provenance(schema_id=…)`),
490+
`AnnotationTable` / `ExpressionMatrix` / `VariantMatrix` / `GeneSet` via `ctx.provenance(schema_id=…)`),
494491
`drift_probe.py`, `SKILL.md`, `catalog/datasets.json`, and `tests/`.
495492
2. Loader picks it up automatically — no edits to `hvantk/hvantk.py` or
496493
`hvantk/tools/plugins/download_cli.py` required.

docs_site/guide/data-sources.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ hvantk reprocess insider:variants \
194194
Ensembl gene annotations (gene name, gene ID, biotype, transcript ID).
195195
URL: https://www.ensembl.org/info/data/ftp/index.html
196196
197-
**Download**: Export from BioMart with the required attributes matching `ENSEMBL_BIOMART_FIELDS` in `hvantk/core/constants.py`. Alternatively, download from the Ensembl FTP:
197+
**Download**: Export from BioMart with the required attributes matching `ENSEMBL_BIOMART_FIELDS` in `hvantk/skills/ensembl_gene/shared/constants.py`. Alternatively, download from the Ensembl FTP:
198198
https://www.ensembl.org/info/data/ftp/index.html
199199
200200
**Build**:

examples/1k_genome/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ Scripts for building a Hail MatrixTable from 1000 Genomes NYGC high-coverage dat
66

77
| Script | Description |
88
|--------|-------------|
9-
| `build_1kg_nygc.py` | Python API: build reference MatrixTable from per-chromosome VCFs |
10-
| `build_1kg_nygc_cli.py` | CLI wrapper: exercises `hvantk utils build-1k-genome` as a subprocess |
9+
| `build_1kg_nygc.py` | Stages NYGC-pattern VCFs and shells out to `hvantk reprocess onek-genomes:variants` |
10+
| `build_1kg_nygc_cli.py` | Same as `build_1kg_nygc.py`, kept as a separate file for backward compatibility |
1111

1212
## Prerequisites
1313

examples/1k_genome/build_1kg_nygc.py

Lines changed: 27 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -179,29 +179,36 @@ def main(
179179
f"No recalibrated genotype VCFs found in {vcf_dir}"
180180
)
181181

182-
# -- Build MatrixTable --
183-
from hvantk.core.builders.genome import build_1k_genome_mt, resolve_delimiter
184-
185-
mt = build_1k_genome_mt(
186-
input_vcfs=stage_dir,
187-
output_mt=output_mt,
188-
sample_annotations=sample_annotations_path,
189-
sample_annotations_delimiter=resolve_delimiter(
190-
sample_annotations_delimiter
191-
),
192-
reference_genome=reference_genome,
193-
chromosomes=chrom_list,
194-
overwrite=overwrite,
195-
auto_convert_bgz=auto_convert_bgz,
196-
)
197-
198-
n_variants = mt.count_rows()
199-
n_samples = mt.count_cols()
182+
# -- Build via hvantk reprocess onek-genomes:variants --
183+
# The legacy `build_1k_genome_mt` Python helper was retired in #116;
184+
# all builds now route through the plugin system.
185+
import subprocess
186+
187+
cmd = [
188+
"hvantk", "reprocess", "onek-genomes:variants",
189+
"--raw-dir", stage_dir,
190+
"--output", output_mt,
191+
"--skip-download",
192+
"--plugin-arg", f"reference_genome={reference_genome}",
193+
"--plugin-arg", f"auto_convert_bgz={'true' if auto_convert_bgz else 'false'}",
194+
]
195+
if chrom_list:
196+
cmd += ["--plugin-arg", f"chromosomes={','.join(chrom_list)}"]
197+
if sample_annotations_path is not None:
198+
logger.warning(
199+
"--sample-annotations is no longer baked into the variants build "
200+
"(post-#116). Run `hvantk reprocess onek-genomes:samples` to "
201+
"build the canonical IGSR samples table, and join post-load. "
202+
"Provided sample_annotations argument (%s) will be ignored.",
203+
sample_annotations_path,
204+
)
205+
206+
logger.info("Invoking: %s", " ".join(cmd))
207+
subprocess.run(cmd, check=True)
208+
200209
logger.info("=" * 42)
201210
logger.info(" Completed : %s", datetime.now().isoformat(timespec="seconds"))
202211
logger.info(" MatrixTable : %s", output_mt)
203-
logger.info(" Variants : %s", f"{n_variants:,}")
204-
logger.info(" Samples : %s", f"{n_samples:,}")
205212
logger.info("=" * 42)
206213
finally:
207214
shutil.rmtree(stage_dir, ignore_errors=True)

examples/1k_genome/build_1kg_nygc_cli.py

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
#!/usr/bin/env python3
22
"""End-to-end CLI wrapper for building a 1000 Genomes MatrixTable via ``hvantk``.
33
4-
Unlike ``build_1kg_nygc.py`` (which calls the Python API directly), this script
5-
invokes ``hvantk build-1k-genome`` as a subprocess so the full CLI path is
6-
exercised — useful for catching argument parsing bugs, entry-point issues, etc.
4+
Stages NYGC-pattern VCFs and invokes ``hvantk reprocess onek-genomes:variants``
5+
as a subprocess so the full CLI path is exercised — useful for catching argument
6+
parsing bugs, entry-point issues, etc.
77
88
Example usage::
99
@@ -141,7 +141,7 @@ def main(
141141
\b
142142
Stages only recalibrated genotype VCFs (*_chr*.recalibrated_variants.vcf.gz)
143143
into a temporary directory, excluding annotated and "others" contig files,
144-
then invokes ``hvantk build-1k-genome`` as a subprocess.
144+
then invokes ``hvantk reprocess onek-genomes:variants`` as a subprocess.
145145
"""
146146
# Resolve sample annotations: absolute path used as-is, otherwise relative to vcf_dir
147147
sample_annotations_path = None
@@ -176,27 +176,31 @@ def main(
176176
f"No recalibrated genotype VCFs found in {vcf_dir}"
177177
)
178178

179-
# -- Build CLI command --
179+
# -- Build CLI command (post-#116: routes through the plugin system) --
180180
cmd = [
181181
"hvantk",
182-
"build-1k-genome",
183-
"--input-vcfs",
182+
"reprocess",
183+
"onek-genomes:variants",
184+
"--raw-dir",
184185
stage_dir,
185-
"--output-mt",
186+
"--output",
186187
output_mt,
187-
"--reference-genome",
188-
reference_genome,
188+
"--skip-download",
189+
"--plugin-arg",
190+
f"reference_genome={reference_genome}",
191+
"--plugin-arg",
192+
f"auto_convert_bgz={'true' if auto_convert_bgz else 'false'}",
189193
]
190194
if chromosomes:
191-
cmd += ["--chromosomes", chromosomes]
195+
cmd += ["--plugin-arg", f"chromosomes={chromosomes}"]
192196
if sample_annotations_path:
193-
cmd += ["--sample-annotations", sample_annotations_path]
194-
if sample_annotations_delimiter is not None:
195-
cmd += ["--sample-annotations-delimiter", sample_annotations_delimiter]
196-
if overwrite:
197-
cmd.append("--overwrite")
198-
if auto_convert_bgz:
199-
cmd.append("--auto-convert-bgz")
197+
logger.warning(
198+
"--sample-annotations is no longer accepted by the variants build "
199+
"(post-#116). Run `hvantk reprocess onek-genomes:samples` to "
200+
"build the canonical IGSR samples table, and join post-load. "
201+
"Provided sample_annotations argument (%s) will be ignored.",
202+
sample_annotations_path,
203+
)
200204

201205
logger.info("Running: %s", " ".join(cmd))
202206
try:
@@ -206,7 +210,7 @@ def main(
206210

207211
if result.returncode != 0:
208212
raise click.ClickException(
209-
f"hvantk build-1k-genome exited with code {result.returncode}"
213+
f"hvantk reprocess onek-genomes:variants exited with code {result.returncode}"
210214
)
211215

212216
logger.info("=" * 42)

examples/clingen/run_ontology_categorization.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ def main():
7474
"""Main workflow with ontology-based categorization."""
7575
import hail as hl
7676
from hvantk.skills.clingen.streamer import ClinGenStreamer
77-
from hvantk.core.utils.mondo_parser import MONDO_DISEASE_CATEGORIES
77+
from hvantk.core.ontology.mondo import MONDO_DISEASE_CATEGORIES
7878

7979
# Ensure data files exist
8080
ensure_data_files()

hvantk/algorithms/ancestry/pipeline.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -786,11 +786,10 @@ def run_ancestry_inference(
786786
Algorithm-side input type
787787
-------------------------
788788
This algorithm consumes raw ``hl.MatrixTable`` instances (genotype
789-
data) rather than the platform's ``ExpressionMatrix`` artifact.
790-
Phase J added a ``hail-mt`` backend to ``ExpressionMatrix``, so
791-
callers may wrap their MatrixTables via
792-
``ExpressionMatrix.from_hail_mt(mt, provenance=...)`` and call
793-
``em.to_hail_mt()`` to retrieve the native object; the
789+
data) rather than the platform's ``VariantMatrix`` artifact.
790+
Callers may wrap their MatrixTables via
791+
``VariantMatrix.from_hail_mt(mt, provenance=...)`` and call
792+
``vm.to_hail_mt()`` to retrieve the native object; the
794793
``required_backend="hail"`` declaration on the ``@algorithm``
795794
decorator still applies because the algorithm body operates
796795
natively on Hail. The wrapping pattern is the canonical way to

hvantk/algorithms/annotation/annotate.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,27 @@ def annotate_clinvar_clnsig(t: hl.Table) -> hl.Table:
1818
1919
Variants are annotated with a clinical significance label based on ClinVar data: "P" for pathogenic, "B" for benign, or missing if neither applies. The annotation is determined by matching ClinVar CLNSIG values against predefined sets of pathogenic and benign labels.
2020
"""
21-
from hvantk.core.constants import CLINVAR_PATHOGENIC_LABELS, CLINVAR_BENIGN_LABELS
21+
# TEMP duplication — tracked by https://github.com/bigbio/hvantk/issues/133
22+
#
23+
# These label sets are also defined in
24+
# hvantk/skills/clinvar/shared/constants.py. Importing them from
25+
# there would violate the algorithms-must-not-import-from-skills
26+
# dependency guard.
27+
#
28+
# The proper fix — parameterizing the schema and vocabulary so this
29+
# function accepts any conformant pathogenicity-labeled table, not
30+
# just ClinVar — is tracked by issue #133. Remove this duplication
31+
# when that parameterization lands.
32+
CLINVAR_PATHOGENIC_LABELS = [
33+
"Pathogenic/Likely_pathogenic",
34+
"Likely_pathogenic",
35+
"Pathogenic",
36+
]
37+
CLINVAR_BENIGN_LABELS = [
38+
"Benign/Likely_benign",
39+
"Likely_benign",
40+
"Benign",
41+
]
2242

2343
logger.info("Annotating ClinVar CLNSIG")
2444
clinvar_ht = load_legacy_table("clinvar")

0 commit comments

Comments
 (0)