hvantk is a multi-omics variant annotation toolkit. It is organized as a
five-package layout — four code layers (core/, algorithms/, skills/,
tools/) plus a substrate-level data registry (resources/) — with a
strict one-way dependency rule enforced by
hvantk/tests/test_dependency_directions.py:
resources/ is a peer of core/ (not a layer above it): both are
substrate that the code layers depend on, neither imports upward.
Placement rule: dataset-registry JSON, JSON schemas, and the validator /
aggregator code that operates on them go in resources/. Per-plugin
catalog JSON (skills/<provider>/catalog/datasets.json) stays with its
plugin and is aggregated by resources/unified_registry.py. See the
Resources Module section below for the
full rule.
Design priorities:
- Stable contracts — algorithms consume typed artifacts; source adapters can rot when upstream APIs change without breaking analysis code.
- Manifest-driven — plugins declare themselves via
plugin.yaml; the loader discovers descriptively first (no imports), binds executable callables lazily. - Provenance everywhere — every artifact carries a
Provenance(plugin, version, source fingerprint, schema id, build timestamp, derivation parents). - Backend-portable + native escape hatch — the artifact API works on either
Hail or pandas; algorithms that legitimately need raw Hail use
core_io.load_native/save_native(zero-cost passthrough with provenance).
hvantk/
├── __main__.py # Main CLI entry point
│
├── core/ # L1: Core infrastructure
│ ├── config.py # Configuration management
│ ├── constants.py # Shared constants
│ ├── protocols.py # Protocol definitions (Builder, Streamer, Downloader)
│ ├── io/ # Artifact loader (load/save Hail Tables, AnnData, etc.)
│ ├── models/ # Domain model types
│ │ ├── annotation_table.py # AnnotationTable artifact
│ │ ├── expression_matrix.py # ExpressionMatrix artifact (AnnData-only)
│ │ ├── variant_matrix.py # VariantMatrix artifact (Hail MatrixTable)
│ │ ├── gene_set.py # GeneSet artifact
│ │ ├── artifact.py # Artifact base + type registry
│ │ ├── backends.py # AlgorithmMeta, Backend, @algorithm decorator
│ │ ├── build_context.py # BuildContext passed to plugin builders
│ │ ├── anndata_utils.py # annotate_column_summary_ad (AnnData obs summary)
│ │ ├── metadata.py # Metadata structs and source descriptions
│ │ └── provenance.py # Source-fingerprint provenance stamping
│ ├── plugin/ # Plugin system
│ │ ├── api.py # Provider, DatasetSpec, PROBE_FINGERPRINT_IGNORED_KEYS
│ │ ├── loader.py # Plugin discovery (filesystem + entry points)
│ │ ├── run_builder.py # run_builder_for_spec() — Phase B orchestrator
│ │ └── drift_runner.py# Drift probe execution
│ ├── utils/ # Cross-cutting utilities
│ │ ├── hail_context.py # Idempotent Hail init
│ │ ├── hail_helpers.py # create_table_base, cleanup_temp_file
│ │ ├── qtl_helpers.py # GTEx variant-ID parsing (shared by eqtl/pqtl)
│ │ ├── bgzf.py # BGZF utilities
│ │ ├── catalog.py # Catalog helpers
│ │ ├── file_utils.py # File I/O helpers
│ │ ├── gene_sets.py # Gene set utilities
│ │ ├── geneset_io.py # Gene set parsing / validation
│ │ ├── genome.py # Genome/contig utilities (contig_recoding)
│ │ ├── table_utils.py # Hail Table manipulation helpers
│ │ └── writers.py # HailTableWriter
│ ├── streamers/ # Base streamer classes (consumed by algorithms/ + skills/)
│ │ ├── gene_disease_table.py # GeneDiseaseTableStreamer (clingen/gencc/cosmic-cgc base)
│ │ ├── variant_table.py # VariantTableStreamer (clinvar base)
│ │ └── gene_catalog.py # GeneCatalogStreamer (hgnc base)
│ └── ontology/ # OBO / MONDO ontology parsers
│ ├── obo.py # Generic OBO parser
│ └── mondo.py # MONDO disease-category map (MONDO_DISEASE_CATEGORIES)
│
├── algorithms/ # L4-L5: Analysis pipelines
│ ├── annotation/ # Variant annotation pipeline
│ ├── ancestry/ # Population ancestry inference
│ ├── enrichex/ # Gene set enrichment analysis
│ ├── expression/ # Expression data processing
│ ├── hgc/ # Joint genotyping (HGC) pipeline
│ ├── psroc/ # Pathogenicity score evaluation
│ ├── ptm/ # Post-translational modification analysis
│ ├── qtlcascade/ # QTL cascade analysis
│ ├── statistics/ # Statistical utilities
│ ├── training_sets/ # Training set construction
│ └── visualization/ # Shared visualization helpers
│
├── skills/ # L2-L3: Per-provider data plugins
│ ├── _conventions/ # Shared plugin contract (SKILL.md)
│ ├── _hooks/ # Plugin lifecycle hooks
│ ├── clingen/ # ClinGen gene-disease validity
│ ├── clinvar/ # ClinVar variant annotations
│ ├── cptac/ # CPTAC proteomics (expression/, phospho/)
│ ├── expression_atlas/ # Expression Atlas bulk RNA-seq
│ ├── gencc/ # GenCC gene-disease assertions
│ ├── gtex_eqtl/ # GTEx eQTL data
│ ├── gwas_catalog/ # GWAS Catalog
│ ├── hgnc/ # HGNC gene nomenclature
│ ├── insider/ # INSIDER protein-protein interaction sites
│ ├── msigdb/ # MSigDB gene sets
│ ├── peptideatlas/ # PeptideAtlas proteomics
│ ├── ucsc_cellbrowser/ # UCSC Cell Browser single-cell RNA-seq
│ └── uniprot_ptm/ # UniProt PTM annotations
│
├── tools/ # CLI command implementations (replaces legacy commands/)
│ ├── ancestry/ # Ancestry CLI subcommands
│ ├── annotation/ # Annotation CLI subcommands
│ ├── build/ # standalone reference panel builds (1k genomes)
│ ├── enrichex/ # EnrichEx CLI subcommands
│ ├── expression/ # Expression analysis commands
│ ├── genesets/ # Gene set extraction/preparation
│ ├── hgc/ # HGC CLI subcommands
│ ├── infra/ # Installation check, BGZF validation, utils
│ ├── plugins/ # hvantk plugins / hvantk drift commands
│ ├── ptm/ # PTM CLI subcommands
│ └── qtl/ # QTL CLI subcommands
│
├── resources/ # Data catalog and schemas
│ ├── registry/ # Surviving legacy per-domain dataset metadata (genomics only)
│ ├── schemas/ # JSON schema definitions
│ └── unified_registry.py# Aggregates per-plugin catalog/datasets.json + legacy registry
│
└── tests/ # Test suite
├── conftest.py # Pytest fixtures (hail_session, etc.)
├── testdata/ # Test data fixtures
├── hgc/ # HGC tests
├── ancestry/ # Ancestry tests
├── psroc/ # PSROC tests
├── enrichex/ # EnrichEx tests
└── test_*.py # Unit and integration tests
The codebase is organized by function and biological domain:
Data Builders (skills/<provider>/builder.py):
- Each plugin under
hvantk/skills/owns its Phase B builder (build_<provider>_<dataset>). Builders returnAnnotationTable,ExpressionMatrix,VariantMatrix, orGeneSetartifacts, stamped withProvenanceby the platform viarun_builder_for_spec. hvantk reprocess <provider>:<dataset>is the only public build path. There is no separate programmatic API; in-process callers that need to build a table inside a tool/pipeline invokehvantk.core.plugin.run_builder.run_builder_for_specdirectly.- Generic Hail helpers live in
hvantk/core/utils/hail_helpers.py(create_table_base,cleanup_temp_file); QTL-shared helpers inhvantk/core/utils/qtl_helpers.py.
Analysis Pipelines (separate modules):
hgc/- Joint genotyping and cohort analysisancestry/- Population ancestry inferencepsroc/- Pathogenicity score evaluationenrichex/- Gene set enrichment analysisptm/- Post-translational modification variant classification. Includesconstraint.py+ helpers for the stratified AF-depletion analysis (hvantk ptm constraint), a tissue/cell-type-aware complement tolandscapeandpopulation.
Data Product Keying:
- Variants - Keyed by
(locus, alleles) - Genes - Keyed by
gene_id - Proteins - Keyed by
protein_idorinterval - Expression - AnnData matrices (
.h5ad) with rows=samples/cells (obs), columns=genes (var)
Every data product is one of three semantic artifact types in
hvantk/core/models/:
| Artifact | Backends | On-disk format | Used for |
|---|---|---|---|
AnnotationTable |
hail / pandas |
.ht/ or .parquet |
variants, gene-disease pairs, eQTLs, PTM sites |
ExpressionMatrix |
anndata |
.h5ad |
bulk + single-cell expression, proteomics matrices |
VariantMatrix |
hail-mt |
.mt/ |
multi-sample variant cohorts (genotypes × samples × multi-field entries) |
GeneSet |
(in-memory frozenset) |
.geneset.json |
curated gene collections |
Each artifact carries a Provenance record (plugin, version, source
fingerprint, schema id, build timestamp, derivation parents). The
@algorithm decorator chains input provenances onto output artifacts
automatically, so the build graph is preserved end-to-end.
Artifacts expose a portable query API (filter, select, join,
with_columns, group_by().agg()) via the col(...)
expression DSL compiled to either backend at execution time. Algorithms
can be written backend-agnostically:
from hvantk.core.models import AnnotationTable, col
def filter_high_impact(ann: AnnotationTable) -> AnnotationTable:
return ann.filter((col("score") > 0.5) & (col("chrom") == "chr17"))When an algorithm legitimately needs the raw native object (Hail-distributed joins, genotype matrices), use the native passthrough:
from hvantk.core import io as core_io
# zero-cost when backend matches file format
ht, source_prov = core_io.load_native("variants.ht") # → hl.Table
filtered = ht.filter(ht.AC > 0)
core_io.save_native(filtered, "filtered.ht", provenance=Provenance(
..., parents=(source_prov,)
))Each plugin under hvantk/skills/<plugin>/ declares itself via plugin.yaml
and provides a builder that returns a typed artifact. The platform's
run_builder_for_spec orchestrates the build:
sequenceDiagram
participant CLI as hvantk reprocess
participant Reg as plugin registry
participant Probe as drift_probe()
participant Build as build_fn(parsed, ctx)
participant IO as core/io
CLI->>Reg: get_dataset("clinvar:variants")
Reg-->>CLI: DatasetSpec (lazy bind on first access)
CLI->>Probe: compute source fingerprint
Probe-->>CLI: probe dict
CLI->>CLI: BuildContext(plugin, version, fingerprint, …)
CLI->>Build: (parsed_input, ctx, **params)
Build-->>CLI: Artifact(provenance=ctx.provenance(schema_id=…))
CLI->>CLI: validate artifact_type + schema_id
CLI->>IO: artifact.save(path)
IO-->>IO: write data + sidecar .provenance.json
A minimal plugin.yaml:
api_version: 2
name: my-source
version: 0.1.0
description: My data source — variant table
datasets:
- name: variants
domain: genomics
backend: hail
artifact_type: AnnotationTable
schema_id: my-source-variants-v1
builder:
module: hvantk.skills.my_source.builder
function: build_my_source_variants
drift_probe:
module: hvantk.skills.my_source.drift_probe
function: fetch_fingerprint
skill: SKILL.md
tests:
command: pytest hvantk/skills/my_source/tests -m hail
fixture: tests/testdata/raw/my-source
schema_snapshot: tests/snapshots/schema.json
row_snapshot: tests/snapshots/sample_rows.json
drift_fingerprint: tests/drift_fingerprint.json
cli:
- command: my-source-download
module: hvantk.skills.my_source.cli
function: download_cmdA matching builder:
# hvantk/skills/my_source/builder.py
import hail as hl
from hvantk.core.models import AnnotationTable, BuildContext
def build_my_source_variants(parsed_input, ctx: BuildContext, **params) -> AnnotationTable:
ht = hl.import_vcf(str(parsed_input), force=True).rows().key_by("locus", "alleles")
return AnnotationTable.from_hail(
ht, provenance=ctx.provenance(schema_id="my-source-variants-v1")
)The plugin loader (hvantk/core/plugin/loader.py) discovers manifests via a
two-pass mechanism:
- Pass 1 (descriptive, eager) — reads YAML, populates
DatasetManifest. No imports.registry.list_manifests()works without optional runtimes. - Pass 2 (executable, lazy) — on first
get_dataset(name), imports the builder/probe modules and caches aDatasetSpec. Missing optional runtimes only affect the single dataset that needs them.
Downloader CLI commands are wired automatically from the manifest's cli:
block — no manual edits in hvantk/tools/plugins/download_cli.py needed.
Streamers (classes that yield batches over a built table) are split between a
base class that lives above the skills layer and per-provider subclasses
that ship with their plugin. This keeps the one-way
skills → algorithms → tools dependency direction clean: the sibling-skill
rule (skills/X cannot import from skills/Y) forces any base class shared by
multiple skills to live in core/streamers/.
| Streamer kind | Lives in | Example |
|---|---|---|
| Shared base class | core/streamers/ |
GeneDiseaseTableStreamer (gene_disease_table.py), VariantTableStreamer (variant_table.py), GeneCatalogStreamer (gene_catalog.py) |
| Per-provider subclass | skills/<provider>/streamers.py |
ClinGenGeneDiseaseTableStreamer (clingen/streamers.py), GenCCGeneDiseaseTableStreamer (gencc/streamers.py), ClinVarVariantTableStreamer (clinvar/streamers.py) |
The primary interface is a well-structured CLI with domain-specific commands:
# Download data
hvantk download ucsc --dataset adultPancreas --output-dir data/
# Build any dataset (full pipeline: download -> parse -> build -> drift check)
hvantk reprocess clinvar:variants --raw-dir data/ --output clinvar.ht
hvantk reprocess ucsc-cellbrowser:adultPancreas --raw-dir data/ --output ucsc.h5ad
# Joint genotyping (HGC)
hvantk hgc gvcf-combine -g /data/gvcfs -o cohort.vds
hvantk hgc compute-qc -i cohort.mt -o cohort_qc.mtRaw File (VCF/TSV/BED) → Builder → Hail Table → Disk (.ht)
Example:
# The reprocess CLI runs the full Phase B pipeline: download → parse → build → drift-check
hvantk reprocess clinvar:variants --raw-dir data/ --output clinvar.htIn-process callers drive the same Phase B builder via
run_builder_for_spec(spec, *, parsed_input, output_path, plugin_version, **params)
(in hvantk/core/plugin/run_builder.py), which takes a resolved DatasetSpec
and returns the stamped Provenance.
Variant Table + Gene Table + Expression Matrix → Integrated Analysis
Example:
# Load different data types
variants = hl.read_table('clinvar.ht')
genes = hl.read_table('ensembl.ht')
expression = hl.read_matrix_table('ucsc.mt')
# Join for integrated analysis
annotated = variants.annotate(
gene=genes[variants.gene_id],
expression=expression[variants.gene_id, :].expression.collect()
)Purpose: Shared infrastructure used by all other modules
Key Components:
config.py- Configuration management, context settingsconstants.py- Shared constants (e.g., Ensembl field definitions)utils/hail_context.py- Hail session initialization and managementprotocols.py- Protocol definitions for extensibilitymodels/- Domain artifact types (AnnotationTable,ExpressionMatrix,VariantMatrix,GeneSet)plugin/- Plugin schema (api.py), discovery (loader.py), and builder dispatch (run_builder.py)
Design principle: No domain logic, only infrastructure
Purpose: Per-provider data plugins. Each provider folder contains plugin.yaml, builder.py, cli.py, drift_probe.py, SKILL.md, catalog/datasets.json, and tests/. Multi-dataset providers (e.g., cptac/) have one sub-folder per dataset.
Current providers (21): alphagenome, clingen, clinvar, cosmic_cgc, cptac, dbnsfp, ensembl_gene, expression_atlas, gencc, gevir, gnomad_metrics, gtex_eqtl, gwas_catalog, hgnc, insider, msigdb, onek_genomes, peptideatlas, pqtl, ucsc_cellbrowser, uniprot_ptm.
Builder outputs:
- Variant / gene tables keyed by
(locus, alleles)orgene_id→AnnotationTable - Expression matrices rows=samples/cells (
obs), columns=genes (var) →ExpressionMatrix - Multi-sample variant cohorts (variants × samples × genotypes) →
VariantMatrix - Gene set collections →
GeneSet
Purpose: Top-level CLI command implementations (replaces the legacy commands/ directory)
Key sub-packages:
plugins/-hvantk reprocess,hvantk drift,hvantk plugins list/show/reloadcommandsbuild/- standalone reference panel builds (e.g. 1000 Genomes)hgc/- HGC joint genotyping subcommands (combine, convert, QC, pipeline)ancestry/,enrichex/,ptm/,qtl/- Per-pipeline CLI subcommandsinfra/- Installation check, BGZF validation, utils
Purpose: High-performance joint genotyping workflows
Features:
- GVCF combination at scale
- Format conversion (VDS ↔ MatrixTable ↔ VCF)
- Comprehensive QC metrics and visualization
- Optimized for large cohorts (1000s of samples)
Status: Feature-complete, well-established module
Purpose: Substrate-level data registry and schema definitions.
Peer of core/, not a layer above it.
Contents:
registry/- Surviving legacy per-domain dataset metadata (genomics only; transcriptomics / proteomics / epigenomics moved into per-pluginhvantk/skills/<provider>/catalog/datasets.json)unified_registry.py-HvantkRegistryaggregator surfaced viahvantk catalog {list,show,stats,search}. Reads both the legacy per-domain registry above and the per-plugin catalog JSON underskills/<provider>/catalog/.schemas/- JSON schema definitions used byschema_validator.pyto validate catalog entries.schema_validator.py- Validation entry point invoked by the unified registry.
Placement rule (what goes here vs. nearby alternatives):
| Lives in | Use for |
|---|---|
resources/registry/ |
Cross-plugin / legacy per-domain catalog JSON that hasn't been migrated to a per-plugin folder. |
resources/schemas/ |
JSON schemas that describe catalog / dataset metadata, shared across plugins. |
resources/unified_registry.py |
Code that aggregates per-plugin catalog JSON with the legacy registry. |
skills/<provider>/catalog/datasets.json |
Per-plugin dataset metadata (the canonical location for new providers). |
core/models/ |
Artifact types (AnnotationTable, ExpressionMatrix, VariantMatrix, GeneSet) — runtime data shapes, not catalog metadata. |
Dependency direction: resources/ may be imported by algorithms/,
skills/, and tools/. It must NOT import from any of those — like
core/, it is substrate. This is enforced by
test_resources_does_not_import_upward in
hvantk/tests/test_dependency_directions.py.
Tests are organized to mirror the module structure:
hvantk/tests/
├── conftest.py # Pytest fixtures (hail_session, etc.)
├── test_*.py # Unit and integration tests
├── hgc/ # HGC module tests
├── ancestry/ # Ancestry module tests
├── psroc/ # PSROC module tests
├── enrichex/ # EnrichEx module tests
└── testdata/ # Test data fixtures
See the "Plugin Contract" section above for the full pattern. The minimal checklist:
- Create
hvantk/skills/<provider>/withplugin.yaml,builder.py(returnsAnnotationTable/ExpressionMatrix/VariantMatrix/GeneSetviactx.provenance(schema_id=…)),drift_probe.py,SKILL.md,catalog/datasets.json, andtests/. - Loader picks it up automatically — no edits to
hvantk/hvantk.pyorhvantk/tools/plugins/download_cli.pyrequired. - CLI downloader command is wired from the manifest's
cli:block. - Add a conformance test using the
run_builder_for_specorchestrator (seehvantk/tests/test_plugin_conformance.pyfor the template). - Run
hvantk plugins list— your plugin should appear.
See hvantk/skills/_conventions/SKILL.md for the full contract.
- Place the algorithm body in
hvantk/algorithms/<domain>/. - Decorate the entry point with
@algorithm:from hvantk.core.models.backends import Backend, algorithm @algorithm( name="my_algorithm", backends=[Backend.PANDAS], inputs={"data": "AnnotationTable", "gene_set": "GeneSet"}, outputs={"result": "AnnotationTable"}, ) def my_algorithm(data, gene_set): ...
- The decorator handles provenance chaining automatically. Output
artifacts inherit
parents = (data.provenance, gene_set.provenance). - For Hail-native algorithms, declare
required_backend="hail"and usecore_io.load_nativeto read inputs natively.
- Place the click command in
hvantk/tools/<domain>/. - Add a
<basename>.tool.yamlmanifest for discoverability viahvantk tools list(descriptive metadata; not authoritative for wiring today — that's Phase Q follow-up). - Wire the command in
hvantk/hvantk.py's top-level CLI group.
- Hail - Distributed data processing framework
- gnomAD - Utilities for gnomAD data
- Click - CLI framework
- Pandas - Data manipulation
- PyYAML - YAML manifest parsing (
plugin.yaml) - Matplotlib/Seaborn/Plotly - Visualization (optional)
- Partitioning: Builders use appropriate partitioning for Hail operations
- Checkpointing: Large intermediate results are checkpointed
- Memory: HGC module optimized for memory-efficient large cohort processing
- Caching: Hail's lazy evaluation allows for optimization