bigbio
diff --git a/‎README.md‎
Lines changed: 4 additions & 4 deletions b/‎README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs_site/architecture.md‎
Lines changed: 11 additions & 14 deletions b/‎docs_site/architecture.md‎
Lines changed: 11 additions & 14 deletions
diff --git a/‎docs_site/guide/data-sources.md‎
Lines changed: 1 addition & 1 deletion b/‎docs_site/guide/data-sources.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/1k_genome/README.md‎
Lines changed: 2 additions & 2 deletions b/‎examples/1k_genome/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/1k_genome/build_1kg_nygc.py‎
Lines changed: 27 additions & 20 deletions b/‎examples/1k_genome/build_1kg_nygc.py‎
Lines changed: 27 additions & 20 deletions
diff --git a/‎examples/1k_genome/build_1kg_nygc_cli.py‎
Lines changed: 23 additions & 19 deletions b/‎examples/1k_genome/build_1kg_nygc_cli.py‎
Lines changed: 23 additions & 19 deletions
diff --git a/‎examples/clingen/run_ontology_categorization.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/clingen/run_ontology_categorization.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎hvantk/algorithms/ancestry/pipeline.py‎
Lines changed: 4 additions & 5 deletions b/‎hvantk/algorithms/ancestry/pipeline.py‎
Lines changed: 4 additions & 5 deletions
diff --git a/‎hvantk/algorithms/annotation/annotate.py‎
Lines changed: 21 additions & 1 deletion b/‎hvantk/algorithms/annotation/annotate.py‎
Lines changed: 21 additions & 1 deletion
@@ -60,13 +60,14 @@ everyone depends on — neither imports upward.
 
 ### Data model
 
-Three semantic artifact types live in [`hvantk/core/models/`](hvantk/core/models/),
+Four semantic artifact types live in [`hvantk/core/models/`](hvantk/core/models/),
 each backed by one of several native engines:
 
 | Artifact | Backends | On-disk format | Used for |
 |---|---|---|---|
 | [`AnnotationTable`](hvantk/core/models/annotation_table.py) | `hail` / `pandas` | `.ht/` or `.parquet` | variants, gene-disease pairs, eQTLs, PTM sites |
-| [`ExpressionMatrix`](hvantk/core/models/expression_matrix.py) | `anndata` / `hail-mt` | `.h5ad` or `.mt/` | bulk + single-cell expression, proteomics matrices |
+| [`ExpressionMatrix`](hvantk/core/models/expression_matrix.py) | `anndata` | `.h5ad` | bulk + single-cell expression, proteomics matrices |
+| [`VariantMatrix`](hvantk/core/models/variant_matrix.py) | `hail-mt` | `.mt/` | multi-sample variant cohorts (genotypes × samples × multi-field entries) |
 | [`GeneSet`](hvantk/core/models/gene_set.py) | (in-memory `frozenset`) | `.geneset.json` | curated gene collections (CHD, MSigDB, …) |
 
 Every artifact carries a [`Provenance`](hvantk/core/models/provenance.py)
@@ -130,7 +131,7 @@ Twenty plugins ship today: `clinvar`, `clingen`, `gencc`, `gwas-catalog`,
 ```
 hvantk/
 ├── core/                       # platform substrate — stable contracts
-│   ├── models/                 # AnnotationTable, ExpressionMatrix, GeneSet,
+│   ├── models/                 # AnnotationTable, ExpressionMatrix, VariantMatrix, GeneSet,
 │   │                           #   Provenance, BuildContext, Expr DSL,
 │   │                           #   AlgorithmMeta (@algorithm decorator)
 │   ├── io/                     # save / load / save_native / load_native,
@@ -163,7 +164,6 @@ hvantk/
 │
 ├── tools/                      # CLI wiring + workflow orchestration
 │   ├── plugins/                # download, drift, reprocess, plugins list
-│   ├── build/                  # build_1k_genome (standalone reference panel build)
 │   ├── hgc/                    # joint-genotyping CLI (lazy-loaded)
 │   ├── ancestry/, enrichex/, expression/, ptm/, qtl/, infra/, genesets/
 │   └── tools_cli.py            # tool registry inspection
 
@@ -43,12 +43,11 @@ hvantk/
 │   ├── config.py          # Configuration management
 │   ├── constants.py       # Shared constants
 │   ├── protocols.py       # Protocol definitions (Builder, Streamer, Downloader)
-│   ├── builders/          # Generic builder helpers
-│   │   └── table.py       # _create_table_base, _cleanup_temp_file, etc.
 │   ├── io/                # Artifact loader (load/save Hail Tables, AnnData, etc.)
 │   ├── models/            # Domain model types
 │   │   ├── annotation_table.py  # AnnotationTable artifact
-│   │   ├── expression_matrix.py # ExpressionMatrix artifact
+│   │   ├── expression_matrix.py # ExpressionMatrix artifact (AnnData-only)
+│   │   ├── variant_matrix.py    # VariantMatrix artifact (Hail MatrixTable)
 │   │   ├── gene_set.py          # GeneSet artifact
 │   │   ├── artifact.py          # Artifact base + type registry
 │   │   ├── backends.py          # AlgorithmMeta, Backend, @algorithm decorator
@@ -143,19 +142,15 @@ The codebase is organized by function and biological domain:
 **Data Builders** (`skills/<provider>/builder.py`):
 - Each plugin under `hvantk/skills/` owns its Phase B builder
   (`build_<provider>_<dataset>`). Builders return `AnnotationTable`,
-  `ExpressionMatrix`, or `GeneSet` artifacts, stamped with `Provenance` by
-  the platform via `run_builder_for_spec`.
+  `ExpressionMatrix`, `VariantMatrix`, or `GeneSet` artifacts, stamped
+  with `Provenance` by the platform via `run_builder_for_spec`.
 - `hvantk reprocess <provider>:<dataset>` is the **only** public build
   path. There is no separate programmatic API; in-process callers that
   need to build a table inside a tool/pipeline invoke
   `hvantk.core.plugin.run_builder.run_builder_for_spec` directly.
 - Generic Hail helpers live in `hvantk/core/utils/hail_helpers.py`
   (`create_table_base`, `cleanup_temp_file`); QTL-shared helpers in
   `hvantk/core/utils/qtl_helpers.py`.
-- **Exception:** the 1000 Genomes builder
-  (`hvantk/core/builders/genome.py`) is the one remaining non-plugin
-  builder, tracked by [#116] for plugin migration. Until then, the legacy
-  `hvantk build-1k-genome` CLI remains the entry point for that dataset.
 
 **Analysis Pipelines** (separate modules):
 - `hgc/` - Joint genotyping and cohort analysis
@@ -181,7 +176,8 @@ Every data product is one of three semantic artifact types in
 | Artifact | Backends | On-disk format | Used for |
 |---|---|---|---|
 | `AnnotationTable` | `hail` / `pandas` | `.ht/` or `.parquet` | variants, gene-disease pairs, eQTLs, PTM sites |
-| `ExpressionMatrix` | `anndata` / `hail-mt` | `.h5ad` or `.mt/` | bulk + single-cell expression, proteomics matrices |
+| `ExpressionMatrix` | `anndata` | `.h5ad` | bulk + single-cell expression, proteomics matrices |
+| `VariantMatrix` | `hail-mt` | `.mt/` | multi-sample variant cohorts (genotypes × samples × multi-field entries) |
 | `GeneSet` | (in-memory `frozenset`) | `.geneset.json` | curated gene collections |
 
 Each artifact carries a `Provenance` record (plugin, version, source
@@ -401,7 +397,7 @@ annotated = variants.annotate(
 - `constants.py` - Shared constants (e.g., Ensembl field definitions)
 - `utils/hail_context.py` - Hail session initialization and management
 - `protocols.py` - Protocol definitions for extensibility
-- `models/` - Domain artifact types (`AnnotationTable`, `ExpressionMatrix`, `GeneSet`)
+- `models/` - Domain artifact types (`AnnotationTable`, `ExpressionMatrix`, `VariantMatrix`, `GeneSet`)
 - `plugin/` - Plugin schema (`api.py`), discovery (`loader.py`), and builder dispatch (`run_builder.py`)
 
 **Design principle**: No domain logic, only infrastructure
@@ -410,11 +406,12 @@ annotated = variants.annotate(
 
 **Purpose**: Per-provider data plugins. Each provider folder contains `plugin.yaml`, `builder.py`, `cli.py`, `drift_probe.py`, `SKILL.md`, `catalog/datasets.json`, and `tests/`. Multi-dataset providers (e.g., `cptac/`) have one sub-folder per dataset.
 
-**Current providers** (20): `alphagenome`, `clingen`, `clinvar`, `cosmic_cgc`, `cptac`, `dbnsfp`, `ensembl_gene`, `expression_atlas`, `gencc`, `gevir`, `gnomad_metrics`, `gtex_eqtl`, `gwas_catalog`, `hgnc`, `insider`, `msigdb`, `peptideatlas`, `pqtl`, `ucsc_cellbrowser`, `uniprot_ptm`.
+**Current providers** (21): `alphagenome`, `clingen`, `clinvar`, `cosmic_cgc`, `cptac`, `dbnsfp`, `ensembl_gene`, `expression_atlas`, `gencc`, `gevir`, `gnomad_metrics`, `gtex_eqtl`, `gwas_catalog`, `hgnc`, `insider`, `msigdb`, `onek_genomes`, `peptideatlas`, `pqtl`, `ucsc_cellbrowser`, `uniprot_ptm`.
 
 **Builder outputs**:
 - Variant / gene tables keyed by `(locus, alleles)` or `gene_id` → `AnnotationTable`
 - Expression matrices rows=genes, columns=samples/cells → `ExpressionMatrix`
+- Multi-sample variant cohorts (variants × samples × genotypes) → `VariantMatrix`
 - Gene set collections → `GeneSet`
 
 ### Tools Module (`tools/`)
@@ -459,7 +456,7 @@ Peer of `core/`, not a layer above it.
 | `resources/schemas/` | JSON schemas that describe catalog / dataset metadata, shared across plugins. |
 | `resources/unified_registry.py` | Code that aggregates per-plugin catalog JSON with the legacy registry. |
 | `skills/<provider>/catalog/datasets.json` | Per-plugin dataset metadata (the canonical location for new providers). |
-| `core/models/` | Artifact types (`AnnotationTable`, `ExpressionMatrix`, `GeneSet`) — runtime data shapes, not catalog metadata. |
+| `core/models/` | Artifact types (`AnnotationTable`, `ExpressionMatrix`, `VariantMatrix`, `GeneSet`) — runtime data shapes, not catalog metadata. |
 
 **Dependency direction**: `resources/` may be imported by `algorithms/`,
 `skills/`, and `tools/`. It must NOT import from any of those — like
@@ -490,7 +487,7 @@ See the "Plugin Contract" section above for the full pattern. The minimal
 checklist:
 
 1. Create `hvantk/skills/<provider>/` with `plugin.yaml`, `builder.py` (returns
-   `AnnotationTable` / `ExpressionMatrix` / `GeneSet` via `ctx.provenance(schema_id=…)`),
+   `AnnotationTable` / `ExpressionMatrix` / `VariantMatrix` / `GeneSet` via `ctx.provenance(schema_id=…)`),
    `drift_probe.py`, `SKILL.md`, `catalog/datasets.json`, and `tests/`.
 2. Loader picks it up automatically — no edits to `hvantk/hvantk.py` or
    `hvantk/tools/plugins/download_cli.py` required.
 
@@ -194,7 +194,7 @@ hvantk reprocess insider:variants \
 Ensembl gene annotations (gene name, gene ID, biotype, transcript ID).
 URL: https://www.ensembl.org/info/data/ftp/index.html
 
-**Download**: Export from BioMart with the required attributes matching `ENSEMBL_BIOMART_FIELDS` in `hvantk/core/constants.py`. Alternatively, download from the Ensembl FTP:
+**Download**: Export from BioMart with the required attributes matching `ENSEMBL_BIOMART_FIELDS` in `hvantk/skills/ensembl_gene/shared/constants.py`. Alternatively, download from the Ensembl FTP:
 https://www.ensembl.org/info/data/ftp/index.html
 
 **Build**:
 
@@ -6,8 +6,8 @@ Scripts for building a Hail MatrixTable from 1000 Genomes NYGC high-coverage dat
 
 | Script | Description |
 |--------|-------------|
-| `build_1kg_nygc.py` | Python API: build reference MatrixTable from per-chromosome VCFs |
-| `build_1kg_nygc_cli.py` | CLI wrapper: exercises `hvantk utils build-1k-genome` as a subprocess |
+| `build_1kg_nygc.py` | Stages NYGC-pattern VCFs and shells out to `hvantk reprocess onek-genomes:variants` |
+| `build_1kg_nygc_cli.py` | Same as `build_1kg_nygc.py`, kept as a separate file for backward compatibility |
 
 ## Prerequisites
 
 
@@ -179,29 +179,36 @@ def main(
                 f"No recalibrated genotype VCFs found in {vcf_dir}"
             )
 
-        # -- Build MatrixTable --
-        from hvantk.core.builders.genome import build_1k_genome_mt, resolve_delimiter
-
-        mt = build_1k_genome_mt(
-            input_vcfs=stage_dir,
-            output_mt=output_mt,
-            sample_annotations=sample_annotations_path,
-            sample_annotations_delimiter=resolve_delimiter(
-                sample_annotations_delimiter
-            ),
-            reference_genome=reference_genome,
-            chromosomes=chrom_list,
-            overwrite=overwrite,
-            auto_convert_bgz=auto_convert_bgz,
-        )
-
-        n_variants = mt.count_rows()
-        n_samples = mt.count_cols()
+        # -- Build via hvantk reprocess onek-genomes:variants --
+        # The legacy `build_1k_genome_mt` Python helper was retired in #116;
+        # all builds now route through the plugin system.
+        import subprocess
+
+        cmd = [
+            "hvantk", "reprocess", "onek-genomes:variants",
+            "--raw-dir", stage_dir,
+            "--output", output_mt,
+            "--skip-download",
+            "--plugin-arg", f"reference_genome={reference_genome}",
+            "--plugin-arg", f"auto_convert_bgz={'true' if auto_convert_bgz else 'false'}",
+        ]
+        if chrom_list:
+            cmd += ["--plugin-arg", f"chromosomes={','.join(chrom_list)}"]
+        if sample_annotations_path is not None:
+            logger.warning(
+                "--sample-annotations is no longer baked into the variants build "
+                "(post-#116). Run `hvantk reprocess onek-genomes:samples` to "
+                "build the canonical IGSR samples table, and join post-load. "
+                "Provided sample_annotations argument (%s) will be ignored.",
+                sample_annotations_path,
+            )
+
+        logger.info("Invoking: %s", " ".join(cmd))
+        subprocess.run(cmd, check=True)
+
         logger.info("=" * 42)
         logger.info("  Completed : %s", datetime.now().isoformat(timespec="seconds"))
         logger.info("  MatrixTable : %s", output_mt)
-        logger.info("  Variants    : %s", f"{n_variants:,}")
-        logger.info("  Samples     : %s", f"{n_samples:,}")
         logger.info("=" * 42)
     finally:
         shutil.rmtree(stage_dir, ignore_errors=True)
 
@@ -1,9 +1,9 @@
 #!/usr/bin/env python3
 """End-to-end CLI wrapper for building a 1000 Genomes MatrixTable via ``hvantk``.
 
-Unlike ``build_1kg_nygc.py`` (which calls the Python API directly), this script
-invokes ``hvantk build-1k-genome`` as a subprocess so the full CLI path is
-exercised — useful for catching argument parsing bugs, entry-point issues, etc.
+Stages NYGC-pattern VCFs and invokes ``hvantk reprocess onek-genomes:variants``
+as a subprocess so the full CLI path is exercised — useful for catching argument
+parsing bugs, entry-point issues, etc.
 
 Example usage::
 
@@ -141,7 +141,7 @@ def main(
     \b
     Stages only recalibrated genotype VCFs (*_chr*.recalibrated_variants.vcf.gz)
     into a temporary directory, excluding annotated and "others" contig files,
-    then invokes ``hvantk build-1k-genome`` as a subprocess.
+    then invokes ``hvantk reprocess onek-genomes:variants`` as a subprocess.
     """
     # Resolve sample annotations: absolute path used as-is, otherwise relative to vcf_dir
     sample_annotations_path = None
@@ -176,27 +176,31 @@ def main(
                 f"No recalibrated genotype VCFs found in {vcf_dir}"
             )
 
-        # -- Build CLI command --
+        # -- Build CLI command (post-#116: routes through the plugin system) --
         cmd = [
             "hvantk",
-            "build-1k-genome",
-            "--input-vcfs",
+            "reprocess",
+            "onek-genomes:variants",
+            "--raw-dir",
             stage_dir,
-            "--output-mt",
+            "--output",
             output_mt,
-            "--reference-genome",
-            reference_genome,
+            "--skip-download",
+            "--plugin-arg",
+            f"reference_genome={reference_genome}",
+            "--plugin-arg",
+            f"auto_convert_bgz={'true' if auto_convert_bgz else 'false'}",
         ]
         if chromosomes:
-            cmd += ["--chromosomes", chromosomes]
+            cmd += ["--plugin-arg", f"chromosomes={chromosomes}"]
         if sample_annotations_path:
-            cmd += ["--sample-annotations", sample_annotations_path]
-        if sample_annotations_delimiter is not None:
-            cmd += ["--sample-annotations-delimiter", sample_annotations_delimiter]
-        if overwrite:
-            cmd.append("--overwrite")
-        if auto_convert_bgz:
-            cmd.append("--auto-convert-bgz")
+            logger.warning(
+                "--sample-annotations is no longer accepted by the variants build "
+                "(post-#116). Run `hvantk reprocess onek-genomes:samples` to "
+                "build the canonical IGSR samples table, and join post-load. "
+                "Provided sample_annotations argument (%s) will be ignored.",
+                sample_annotations_path,
+            )
 
         logger.info("Running: %s", " ".join(cmd))
         try:
@@ -206,7 +210,7 @@ def main(
 
         if result.returncode != 0:
             raise click.ClickException(
-                f"hvantk build-1k-genome exited with code {result.returncode}"
+                f"hvantk reprocess onek-genomes:variants exited with code {result.returncode}"
             )
 
         logger.info("=" * 42)
 
@@ -74,7 +74,7 @@ def main():
     """Main workflow with ontology-based categorization."""
     import hail as hl
     from hvantk.skills.clingen.streamer import ClinGenStreamer
-    from hvantk.core.utils.mondo_parser import MONDO_DISEASE_CATEGORIES
+    from hvantk.core.ontology.mondo import MONDO_DISEASE_CATEGORIES
 
     # Ensure data files exist
     ensure_data_files()
 
@@ -786,11 +786,10 @@ def run_ancestry_inference(
     Algorithm-side input type
     -------------------------
     This algorithm consumes raw ``hl.MatrixTable`` instances (genotype
-    data) rather than the platform's ``ExpressionMatrix`` artifact.
-    Phase J added a ``hail-mt`` backend to ``ExpressionMatrix``, so
-    callers may wrap their MatrixTables via
-    ``ExpressionMatrix.from_hail_mt(mt, provenance=...)`` and call
-    ``em.to_hail_mt()`` to retrieve the native object; the
+    data) rather than the platform's ``VariantMatrix`` artifact.
+    Callers may wrap their MatrixTables via
+    ``VariantMatrix.from_hail_mt(mt, provenance=...)`` and call
+    ``vm.to_hail_mt()`` to retrieve the native object; the
     ``required_backend="hail"`` declaration on the ``@algorithm``
     decorator still applies because the algorithm body operates
     natively on Hail. The wrapping pattern is the canonical way to
 
@@ -18,7 +18,27 @@ def annotate_clinvar_clnsig(t: hl.Table) -> hl.Table:
 
     Variants are annotated with a clinical significance label based on ClinVar data: "P" for pathogenic, "B" for benign, or missing if neither applies. The annotation is determined by matching ClinVar CLNSIG values against predefined sets of pathogenic and benign labels.
     """
-    from hvantk.core.constants import CLINVAR_PATHOGENIC_LABELS, CLINVAR_BENIGN_LABELS
+    # TEMP duplication — tracked by https://github.com/bigbio/hvantk/issues/133
+    #
+    # These label sets are also defined in
+    # hvantk/skills/clinvar/shared/constants.py. Importing them from
+    # there would violate the algorithms-must-not-import-from-skills
+    # dependency guard.
+    #
+    # The proper fix — parameterizing the schema and vocabulary so this
+    # function accepts any conformant pathogenicity-labeled table, not
+    # just ClinVar — is tracked by issue #133. Remove this duplication
+    # when that parameterization lands.
+    CLINVAR_PATHOGENIC_LABELS = [
+        "Pathogenic/Likely_pathogenic",
+        "Likely_pathogenic",
+        "Pathogenic",
+    ]
+    CLINVAR_BENIGN_LABELS = [
+        "Benign/Likely_benign",
+        "Likely_benign",
+        "Benign",
+    ]
 
     logger.info("Annotating ClinVar CLNSIG")
     clinvar_ht = load_legacy_table("clinvar")