@@ -43,12 +43,11 @@ hvantk/
4343│ ├── config.py # Configuration management
4444│ ├── constants.py # Shared constants
4545│ ├── protocols.py # Protocol definitions (Builder, Streamer, Downloader)
46- │ ├── builders/ # Generic builder helpers
47- │ │ └── table.py # _create_table_base, _cleanup_temp_file, etc.
4846│ ├── io/ # Artifact loader (load/save Hail Tables, AnnData, etc.)
4947│ ├── models/ # Domain model types
5048│ │ ├── annotation_table.py # AnnotationTable artifact
51- │ │ ├── expression_matrix.py # ExpressionMatrix artifact
49+ │ │ ├── expression_matrix.py # ExpressionMatrix artifact (AnnData-only)
50+ │ │ ├── variant_matrix.py # VariantMatrix artifact (Hail MatrixTable)
5251│ │ ├── gene_set.py # GeneSet artifact
5352│ │ ├── artifact.py # Artifact base + type registry
5453│ │ ├── backends.py # AlgorithmMeta, Backend, @algorithm decorator
@@ -143,19 +142,15 @@ The codebase is organized by function and biological domain:
143142** Data Builders** (` skills/<provider>/builder.py ` ):
144143- Each plugin under ` hvantk/skills/ ` owns its Phase B builder
145144 (` build_<provider>_<dataset> ` ). Builders return ` AnnotationTable ` ,
146- ` ExpressionMatrix ` , or ` GeneSet ` artifacts, stamped with ` Provenance ` by
147- the platform via ` run_builder_for_spec ` .
145+ ` ExpressionMatrix ` , ` VariantMatrix ` , or ` GeneSet ` artifacts, stamped
146+ with ` Provenance ` by the platform via ` run_builder_for_spec ` .
148147- ` hvantk reprocess <provider>:<dataset> ` is the ** only** public build
149148 path. There is no separate programmatic API; in-process callers that
150149 need to build a table inside a tool/pipeline invoke
151150 ` hvantk.core.plugin.run_builder.run_builder_for_spec ` directly.
152151- Generic Hail helpers live in ` hvantk/core/utils/hail_helpers.py `
153152 (` create_table_base ` , ` cleanup_temp_file ` ); QTL-shared helpers in
154153 ` hvantk/core/utils/qtl_helpers.py ` .
155- - ** Exception:** the 1000 Genomes builder
156- (` hvantk/core/builders/genome.py ` ) is the one remaining non-plugin
157- builder, tracked by [ #116 ] for plugin migration. Until then, the legacy
158- ` hvantk build-1k-genome ` CLI remains the entry point for that dataset.
159154
160155** Analysis Pipelines** (separate modules):
161156- ` hgc/ ` - Joint genotyping and cohort analysis
@@ -181,7 +176,8 @@ Every data product is one of three semantic artifact types in
181176| Artifact | Backends | On-disk format | Used for |
182177| ---| ---| ---| ---|
183178| ` AnnotationTable ` | ` hail ` / ` pandas ` | ` .ht/ ` or ` .parquet ` | variants, gene-disease pairs, eQTLs, PTM sites |
184- | ` ExpressionMatrix ` | ` anndata ` / ` hail-mt ` | ` .h5ad ` or ` .mt/ ` | bulk + single-cell expression, proteomics matrices |
179+ | ` ExpressionMatrix ` | ` anndata ` | ` .h5ad ` | bulk + single-cell expression, proteomics matrices |
180+ | ` VariantMatrix ` | ` hail-mt ` | ` .mt/ ` | multi-sample variant cohorts (genotypes × samples × multi-field entries) |
185181| ` GeneSet ` | (in-memory ` frozenset ` ) | ` .geneset.json ` | curated gene collections |
186182
187183Each artifact carries a ` Provenance ` record (plugin, version, source
@@ -401,7 +397,7 @@ annotated = variants.annotate(
401397- ` constants.py ` - Shared constants (e.g., Ensembl field definitions)
402398- ` utils/hail_context.py ` - Hail session initialization and management
403399- ` protocols.py ` - Protocol definitions for extensibility
404- - ` models/ ` - Domain artifact types (` AnnotationTable ` , ` ExpressionMatrix ` , ` GeneSet ` )
400+ - ` models/ ` - Domain artifact types (` AnnotationTable ` , ` ExpressionMatrix ` , ` VariantMatrix ` , ` GeneSet ` )
405401- ` plugin/ ` - Plugin schema (` api.py ` ), discovery (` loader.py ` ), and builder dispatch (` run_builder.py ` )
406402
407403** Design principle** : No domain logic, only infrastructure
@@ -410,11 +406,12 @@ annotated = variants.annotate(
410406
411407** Purpose** : Per-provider data plugins. Each provider folder contains ` plugin.yaml ` , ` builder.py ` , ` cli.py ` , ` drift_probe.py ` , ` SKILL.md ` , ` catalog/datasets.json ` , and ` tests/ ` . Multi-dataset providers (e.g., ` cptac/ ` ) have one sub-folder per dataset.
412408
413- ** Current providers** (20 ): ` alphagenome ` , ` clingen ` , ` clinvar ` , ` cosmic_cgc ` , ` cptac ` , ` dbnsfp ` , ` ensembl_gene ` , ` expression_atlas ` , ` gencc ` , ` gevir ` , ` gnomad_metrics ` , ` gtex_eqtl ` , ` gwas_catalog ` , ` hgnc ` , ` insider ` , ` msigdb ` , ` peptideatlas ` , ` pqtl ` , ` ucsc_cellbrowser ` , ` uniprot_ptm ` .
409+ ** Current providers** (21 ): ` alphagenome ` , ` clingen ` , ` clinvar ` , ` cosmic_cgc ` , ` cptac ` , ` dbnsfp ` , ` ensembl_gene ` , ` expression_atlas ` , ` gencc ` , ` gevir ` , ` gnomad_metrics ` , ` gtex_eqtl ` , ` gwas_catalog ` , ` hgnc ` , ` insider ` , ` msigdb ` , ` onek_genomes ` , ` peptideatlas ` , ` pqtl ` , ` ucsc_cellbrowser ` , ` uniprot_ptm ` .
414410
415411** Builder outputs** :
416412- Variant / gene tables keyed by ` (locus, alleles) ` or ` gene_id ` → ` AnnotationTable `
417413- Expression matrices rows=genes, columns=samples/cells → ` ExpressionMatrix `
414+ - Multi-sample variant cohorts (variants × samples × genotypes) → ` VariantMatrix `
418415- Gene set collections → ` GeneSet `
419416
420417### Tools Module (` tools/ ` )
@@ -459,7 +456,7 @@ Peer of `core/`, not a layer above it.
459456| ` resources/schemas/ ` | JSON schemas that describe catalog / dataset metadata, shared across plugins. |
460457| ` resources/unified_registry.py ` | Code that aggregates per-plugin catalog JSON with the legacy registry. |
461458| ` skills/<provider>/catalog/datasets.json ` | Per-plugin dataset metadata (the canonical location for new providers). |
462- | ` core/models/ ` | Artifact types (` AnnotationTable ` , ` ExpressionMatrix ` , ` GeneSet ` ) — runtime data shapes, not catalog metadata. |
459+ | ` core/models/ ` | Artifact types (` AnnotationTable ` , ` ExpressionMatrix ` , ` VariantMatrix ` , ` GeneSet ` ) — runtime data shapes, not catalog metadata. |
463460
464461** Dependency direction** : ` resources/ ` may be imported by ` algorithms/ ` ,
465462` skills/ ` , and ` tools/ ` . It must NOT import from any of those — like
@@ -490,7 +487,7 @@ See the "Plugin Contract" section above for the full pattern. The minimal
490487checklist:
491488
4924891 . Create ` hvantk/skills/<provider>/ ` with ` plugin.yaml ` , ` builder.py ` (returns
493- ` AnnotationTable ` / ` ExpressionMatrix ` / ` GeneSet ` via ` ctx.provenance(schema_id=…) ` ),
490+ ` AnnotationTable ` / ` ExpressionMatrix ` / ` VariantMatrix ` / ` GeneSet ` via ` ctx.provenance(schema_id=…) ` ),
494491 ` drift_probe.py ` , ` SKILL.md ` , ` catalog/datasets.json ` , and ` tests/ ` .
4954922 . Loader picks it up automatically — no edits to ` hvantk/hvantk.py ` or
496493 ` hvantk/tools/plugins/download_cli.py ` required.
0 commit comments