SPEC.md is the v0.1.0 build contract and is complete: all fifteen phases are
implemented, the §16 definition-of-done is backed by an executable acceptance
suite, and CI is green across lint, type, test, docs, examples, and the Rust
crate (with a native↔Python FM-index parity run).
This document is the contract for what comes after v0.1.0 — the work to "bake"
the release: turning the swappable interfaces and weight-free stubs into pinned,
verified, real implementations, wiring the native kernels into the actual hot
paths, and earning the validation a v1.0 deserves. It uses the same structure as
SPEC.md: each phase lists Context, Deliverables, Defaults &
decisions, and Tests, and a phase is "done" only when its deliverables exist,
ruff/mypy --strict pass, CI is green, and its tests pass.
The guiding rule from SPEC.md still holds: when a decision is unspecified, prefer
the option that maximizes reproducibility, honest uncertainty, and
population-aware safety — in that order. Nothing here promises clinical
applicability; AlleleForge generates rigorously uncertain hypotheses.
R0 (release hardening) ─┬─> R1 (real weights) ──> R5 (validation) ──> R6 (v1.0)
├─> R2 (native kernels on hot paths)
└─> R3 (external adapters)
R4 (scale) draws on R1+R2 and feeds R5.
R0 gates a public v0.1.0. R1, R2, and R3 are independent and can proceed in parallel once R0 lands. R5 (validation) needs R1. R6 (v1.0) needs R1 + R5.
Status legend: ☐ not started · ◐ in progress · ☑ done.
Context. Everything required to cut a trustworthy v0.1.0 that others can
install, reproduce, and cite. The code is done; this is the operational freeze.
Deliverables.
- Pin every artifact. Replace each
checkpoint_sha256: null/ datasetsha256: nullwith the real content hash of a frozen release artifact, so the consent-gated downloaders will actually fetch (an unverifiable artifact is refused by design). Record the pinned versions indocs/data.mdand each model card. (☐ blocked on freezing the real artifacts — the only remaining R0 item; the gate already refuses anull-hash fetch.) - Supply-chain (☑ landed). Dependabot covers
pip+cargo+github-actions(.github/dependabot.yml); a CIsecurityjob runspip-audit+cargo audit; the release pipeline emits a CycloneDX SBOM (sbomjob) and attaches it to the GitHub Release. - Reproducibility audit (☑ landed).
scripts/reproduce.py(andmake reproduce) re-derives the canonical weight-free design run from config + seed, asserts run-to-run determinism, and diffs a canonicalized digest against a committed golden manifest; a CIreproducejob gates it. - Version bump to
0.1.0(drop.dev0) at tag time; confirm theaforge_nativeconstant and_version.pyagree.
Defaults & decisions. First public tag is v0.1.0; PyPI Trusted Publishing
- multi-arch Docker + Zenodo DOI are already wired in
release.yml. Artifacts are pinned by content hash, never by mutable tag.
Tests. A test asserts no bundled card/descriptor ships a null hash once R0
closes; the reproduce target is exercised in CI against the stubs.
Context. v0.1.0 ships correct, swappable interfaces exercised by weight-free stubs. R1 makes the real predictors load — through the license-gated, consent-required, checksum-verified model zoo — so a user who opts in gets the published models, with the checkpoint recorded in every result's provenance.
Deliverables.
- Backbone download/consent flow (the first slice — landing with this spec).
Route
_HuggingFaceEmbedder(Nucleotide Transformer v2 / Caduceus / Evo 2) through the model zoo instead of a barefrom_pretrained(model_id):ModelRegistry.authorize(name, *, use, consent)— the license + consent gate for hub-resolved models, returning the provenanceModelCheckpoint.SequenceEmbedder.resolve_weights(...)— usesregistry.checkpoint(...)to fetch-and-checksum a pinned single artifact when the card pins a hash, elseauthorize(...); records the resolvedModelCheckpoint. No consent ⇒ConsentError; wrong license-for-use ⇒LicenseError; bad bytes ⇒ChecksumError. The whole flow is CI-tested with an injected downloader (no network, no torch); the actual tensor load staysreal_weights-gated.model_checkpoint()on the embedder so a scorer can stamp the backbone into provenance.
- Menu provenance records every model invoked (☑ landed).
design()stamps the card-backedModelCheckpointof each eligible chemistry's default scorers intoRankedMenu.provenance.models(deduped by name + version, scoped to the chemistries that were eligible), and the HTML/PDF report footers render them. The reproducibility golden captures the (deterministic) set. The real-backbone hash fills the stub'snullonce R0 pins it. - Shared weight gate (◐ landed). One
model_zoo.loader.WeightGatemixin implements the consent/license/checksum resolution for every trained model, so the flow lives in one place rather than per chemistry. - Per-chemistry real scorers, each behind its card and the shared gate. The
consent/license/checksum resolution is wired for all of them (◐); the
trained forward pass over the loaded weights is the remaining step (needs
the real weights /
real_weights):- Cas9 efficiency: the backbone resolves through the gate; loading the fitted Rule Set 3 coefficients + deep-ensemble heads is next.
- Cas9 outcome: inDelphi / Lindel / X-CRISP adapters gated (◐); forward pass next.
- Base-edit outcome: BE-DICT / BE-Hive adapters gated (◐); forward pass next.
- Prime efficiency: DeepPrime / GenET adapters gated (◐); PRIDICT2.0 trained weights replace the heuristic next.
- ONNX export path (
export_onnx, ◐ landed): the backbone exports to a dynamic-axes ONNX graph (batch + sequence dims dynamic, opset 17) by tracing the consent-resolved model on a sample sequence, for portable inference under any ONNX runtime. The export code is wired now; running it needs themlextra + real weights, so it staysreal_weights-gated like the tensor load.
Defaults & decisions. Default backbone stays Nucleotide Transformer v2 (500M); it is CC-BY-NC-SA — loadable for research, refused for commercial use by the license gate. Real weights are never vendored; they are fetched at runtime with explicit consent and verified against the pinned card hash. The stub path remains the CI default so the suite needs no weights.
Tests. Consent/license/checksum behavior is unit-tested with a fake downloader
(CI). Real embedding/scoring parity-vs-published-numbers tests are marked
real_weights (opt-in, skipped in CI). A provenance test asserts a real-backbone
scorer records the backbone ModelCheckpoint.
Context. v0.1.0 ships the native FM-index (bwt) with a Python-parity
test, but it is opt-in and not yet on a production hot path. The spec layout also
reserves kmer and haplotype kernels. R2 implements them and wires them into
the call sites that need them, so the native build delivers real speedups — not
dead code.
Deliverables.
- True-linear suffix-array construction — SA-IS (◐ landed).
bwt.rsbuilds the suffix array by SA-IS (sais.rs, induced sorting,O(n)), replacing the prefix-doubling (O(n log² n)) build behind the same interface — no degradation on the long poly-A / poly-N runs and tandem repeats real genomes contain. Output is byte-identical to the direct sort (unique sentinel ⇒ unique SA), pinned directly by a parity test of the exposedfm_suffix_arrayagainst the ground-truth direct sort (pathological + fuzz inputs) and end-to-end by the FM-indexcount/locateparity over low-complexity and random-long inputs. kmerkernel (◐ landed). A native Rust k-mer kernel (kmer.rs) + pure -Python fallback (offtarget._kmer), wired into the off-target scan as a seed-and-extend prefilter (scan_sequence(..., seed=...)). It is a proven superset (pigeonhole: ≥1 uncut, substitution-free block of lengthk = ⌊n/(E+1)⌋survives any in-budget alignment), pinned by an exhaustive randomized seeded ≡ brute-force test. Honest finding from the R2 micro-benchmark (scripts/native_speedup.py): the seed must run before the PAM check to prune, and it only pays off when selective (k ≥ 5, i.e. low edit budget) — measured ~2–4x there, a no-op at AlleleForge's default ≤4-mismatch+ bulge budget (the seed is too short to prune). So it auto-engages only whenk ≥ 5; the FM-index seed-and-extend remains the genome-scale path for the default budget.- FM-index wired into the reference scan (◐ landed). The engine's stage-1
reference search now runs FM-index seed-and-extend (
scan_sequence(..., use_fm_index=...)): each concrete PAM is located in a content-addressed FM-index (the PAM is the seed) and only those anchors are extended by the shared alignment, replacing the linearO(n)PAM pass. It returns byte-identical hits to the brute-force scan (pinned by a randomized parity test on both the low-level scan and the engine report), and auto-engages per region pastFM_INDEX_AUTO_THRESHOLD(1 Mb) so genome-scale contigs take the indexed path while small inputs stay on the linear scan. The persistent, memory-mapped whole-genome variant of this index is R4'sGenomeIndex. haplotypekernel (◐ landed). A native Rust haplotype-walk kernel (haplotype.rs:haplotype_apply_variants) + pure-Python fallback (offtarget._haplotype) wired into the haplotype off-target engine (haplotype.py::_apply_all): it materializes a common haplotype's alternative sequence by applying the haplotype's full variant set to the reference window (right-to-left so indels keep coordinates valid; a reference-base clash yieldsNoneand the engine skips it). It is byte-identical to the Python path — pinned by a fuzz parity test (lowercase refs,Nbases, indels, overlaps, out-of-window positions) — and measures ~4x in the R2 micro-benchmark. With this the three spec kernels (bwt/kmer/haplotype) are all on their hot paths behind the fallback-plus-parity discipline.- A
bench/native_speedup.pymicro-benchmark recording native-vs-Python wall time per kernel (reported, not gated).
Defaults & decisions. Every native kernel keeps a correct pure-Python
fallback and a parity test pinning byte-identical results; prefer_native
selects it when built. The library never requires the crate.
Tests. Parity tests per kernel (native == Python) run in the CI rust job;
the off-target engine's existing tests run on both paths (fallback in the main
matrix, native in the rust job).
Context. Three NotImplementedError adapters were wired but inert:
cas_offinder_adapter (off-target cross-check), variant/effect VEP REST, and
the HGVS projection backend. R3 makes them real, behind the same consent/registry
discipline as data and models. All three now have a real implementation behind
recorded-fixture tests (◐ landed); only the live network/binary calls are
opt-in (live_integration-marked) and never run in CI.
Deliverables.
- Cas-OFFinder adapter (◐ landed):
format_inputbuilds the binary's input deck (spacer-Ns + PAM pattern, query + mismatch budget);parse_outputreads its results in both the legacy 6-column and bulge-aware 8-column layouts;run(..., runner=...)orchestrates write→invoke→parse with an injectable runner, so CI tests everything but the subprocess call itself, and disagreements are surfaced via the existingdisagreements()cross-check. - VEP consequence (◐ landed):
VepRestPredictorissues the region-endpoint GET through an injectable fetcher andparse_vep_responsemaps the JSON to aVariantEffect(MANE/canonical or named-transcript selection, most-severe SO term, impact tier), with response caching keyed by(variant, assembly, transcript). CI replays a recorded VEP response; only the live GET is opt-in. - HGVS projection (◐ landed):
HgvsLibraryProjectorwraps the realhgvslibrary (UTA + SeqRepo,AssemblyMapper.c_to_g) behind the existingHgvsProjectorinterface; the import guard degrades to a clearRuntimeErrorwhen the optional library is absent (tested), and the live projection is opt-in.
Defaults & decisions. External tools are optional; their absence degrades gracefully to the native engine with an explicit flag, never a crash. Network calls require explicit opt-in and are cached.
Tests. Adapters are tested against recorded fixtures (no live network in CI); a live-integration test is marked and opt-in.
Context. Make the genome-scale and cohort-scale paths real.
Deliverables.
- Whole-genome on-disk FM-index (◐ landed).
genome.GenomeIndexbuilds one content-addressed FM-index per contig (both strands) over a reference, driven by R2's native SA-IS (the on-diskFMIndexbuild now uses the linear-time kernel, so the persistent path scales to whole chromosomes). It survives across runs (a re-run memory-maps the cached contig index instead of rebuilding) and is queried over its memory map without pinning the index in RAM. The off-target engine consumes it viasearch(..., genome_index=...)for the reference scan — identical hits to the per-call build (a parity test pins this), but built once and reused. Validated in CI on a downsampled-chromosome fixture in the rust job (native SA-IS build → mmap query → linear-scan parity, plus cross-run reuse). Full hg38 / T2T-CHM13 builds remain an opt-in nightly. - Cohort throughput (◐ landed).
design.design_many(variants, ...)streams a whole cohort throughdesign: the input is consumed lazily (any iterable — acyvcf2stream, a generator, a list) and only the per-item working set is held (each menu is summarized/optionally written to disk, then released), so peak memory does not grow with cohort size — passon_resultfor a trulyO(1)run. It is resumable via a JSONL run manifest (a re-run skips items already recorded) that opens with a provenance header, isolates per-item failures (an unresolvable variant is recorded, not fatal), and offers a thread-parallel path (max_workers+ areference_factory, since a pyfaidx handle is not thread-safe to share). Thecyvcf2fast path landed (variant.iter_vcf): a streaming adapter reads a VCF withcyvcf2and yields oneVcfRecordper concrete ALT allele (multi-allelic split; symbolic/spanning /non-ACGTN alleles skipped; non-PASSdropped by default), feedingdesign_manylazily. Its reader is injectable (duck-typed to the cyvcf2Variantshape), so the split/filter logic is CI-tested without the native library; a path open raises a clearRuntimeErrornaming thegenomeextra whencyvcf2is absent. Whole-genome scale validation on a real VCF remains an opt-in nightly. - Content-addressed cross-run caches (◐ landed). A shared
alleleforge.cache.ContentAddressedCache(sharded, atomically-written disk K/V under the cache dir) backs two cross-run memos:CachedEmbedder.persistentreuses embeddings across runs (scoped per backbone identity), andOffTargetCache(viasearch(..., cache=...)) reuses the reference scan. The off-target cache is safety-gated — used only for a reference-only search with the default scorer (no gnomAD/haplotype/patient augmentation, which the key cannot fully capture), so a stale entry can never be served for a danger scan.
Defaults & decisions. Streaming over materializing; bounded memory is a hard requirement; every batch run emits a provenance manifest.
Tests. Scale tests run on a downsampled chromosome fixture in CI; full-genome runs are an opt-in nightly.
Context. The honesty claims must be earned on real data, not asserted.
Deliverables.
- Reproduce published efficiency/outcome numbers for each real scorer on its source benchmark split (R1), recorded as signed CRISPR-Bench results.
- Calibration study. The machinery and the regeneration script have landed
(◐), on the weight-free splits; the real-data numbers fill in with R1.
scoring.ConformalCalibratorrecalibrates predictive intervals to a target coverage with the finite-sample split-conformal guarantee (the regression analog ofIsotonicCalibratorfor probabilities), preserving relative interval shape;empirical_coveragemeasures whether intervals need it. The cross-cell-type generalization gap is quantified bybenchmark.generalization_gap— a task's primary metric on an in-context fold (a training-seen cell type) vs the held-out cell type, oriented so positive means worse generalization.scripts/calibration_study.pyregenerates the calibration report — per-task ECE from CRISPR-Bench, the generalization-gap table, and a conformal recalibration demonstration (coverage before/after at the spec levels). Measured ECE on real data per task remains (needs R1); the gap machinery runs now on the weight-free cross-context splits. - Methods preprint (◐ draft landed).
docs/paper/preprint.mddrafts the outline into a full manuscript — abstract, methods, the CRISPR-Bench design, the weight-free end-to-end results (the reference-bias reproduction and the split-conformal coverage table), reproducibility, and discussion. The per-task accuracy-vs-published numbers are explicitly marked[pending R1]; they fill in with the real-weights integration, at which point the draft becomes the posted preprint. - Reproducible figures (☑ landed).
alleleforge.vizrenders four figures from the weight-free, deterministic pipeline with a dependency-free SVG renderer (no plotting stack): the reference-bias reproduction, the split-conformal coverage restoration, per-task ECE, and the cross-cell-type generalization gap. They regenerate byte-for-byte (scripts/figures.py/make figures), are committed underdocs/assets/figures/, and are embedded in the preprint and README. The per-task values fill in with R1; the renderer and machinery are final.
Defaults & decisions. ECE is reported on every task (already enforced by CRISPR-Bench); a scorer whose intervals are miscalibrated on real data is recalibrated or shipped with the OOD flag dominant, never silently.
Tests. Benchmark runner produces signed, provenance-stamped result JSON for each real scorer; leaderboard renders them; calibration figures regenerate from a script.
v1.0 is cut only when:
- Every shipped card/descriptor pins a real, verified artifact hash (R0).
- At least the Cas9-efficiency, PE-efficiency, and one outcome scorer load real weights through the consent/checksum flow and reproduce their published numbers within tolerance (R1 + R5).
- The native
bwt/kmer/haplotypekernels are on their hot paths with parity tests and a recorded speedup (R2). - Calibration (ECE) is measured on real data and intervals are calibrated or honestly flagged (R5); the cross-context generalization gap is documented.
- The methods preprint is posted and the Zenodo DOI minted (R5 + R0).
Until then the public release stays v0.1.0: three chemistries end to end with honest uncertainty and the benchmark, baked but not yet externally validated.
This document extends SPEC.md. When v2 and v1 disagree, v2 wins for post-1.0
work; otherwise SPEC.md remains the contract for the shipped surface.