AlleleForge

Variant in, corrective edit out.

A variant-driven, multi-modality, uncertainty-aware CRISPR guide & edit design framework — across SpCas9 nuclease, base editors, and prime editors, with population-aware off-target nomination and a public benchmark.

Warning

AlleleForge is a research tool. It is not a medical device and does not provide medical advice. It produces ranked, explicitly uncertain design hypotheses. Every off-target nomination it makes is computational and must be experimentally validated before any wet-lab or therapeutic use. See Scope & responsible use.

Why AlleleForge

Most monogenic disease is, in effect, a copy-paste error at the allele level. The job of a genome editor is to forge the corrective edit. Today that job is fragmented across a dozen single-purpose tools — one to pick a guide, another to predict efficiency, a third to enumerate prime-editing extensions, a fourth to scan for off-targets — none of which speak the same language and few of which agree on what "uncertain" means.

AlleleForge unifies the journey behind one typed interface: you supply a variant, it returns a ranked, safety-annotated menu of candidate edits spanning every applicable modality, each carrying a calibrated uncertainty interval, a predicted edit outcome, and a population- and haplotype-aware off-target report.

The four-axis gap it fills

For prime editing in particular, no existing open-source tool combines all four of:

Axis	PRIDICT2.0	PrimeDesign / PrimeVar	CRISPRme	AlleleForge
Therapeutic variant front-end	✗	✓	✗	✓
ML efficiency with calibrated uncertainty	✓	✗	✗	✓
Outcome / byproduct prediction	partial	✗	✗	✓
Population-aware off-target	✗	✗	✓	✓

AlleleForge's contribution is to wrap the best existing models (PRIDICT2.0, BE-Hive, BE-DICT, inDelphi, Cas-OFFinder, …) behind a unified, typed, uncertainty-honest interface and add value at the seams.

Design principles

Variant-first. The canonical journey starts from what is broken, not from a guide.
Honest uncertainty. Every numeric prediction ships with a calibrated interval. No scorer returns a bare float.
Population-aware by default. Reference-only off-target analysis is a known safety gap (the Casgevy / BCL11A rs114518452 case is the canonical cautionary tale). AlleleForge searches population variation by default.
Wrap, don't rebuild. Integrate proven tools; add new ML only at genuine coverage gaps.
Reproducible to the byte. Pinned environments, versioned datasets, deterministic seeds, content-hashed checkpoints.
Three audiences, one core. The library is the source of truth; CLI and web are thin shells over it.
Typed and tested. mypy --strict, ruff, and Hypothesis property tests on all core logic.
Cite everything. Every dataset, model, and scoring function carries a literature citation and a version.

Architecture

AlleleForge is strictly layered: lower layers know nothing about higher ones. The Designer is the only component that sees the whole pipeline; every domain service is independently testable and usable.

flowchart TB
    subgraph I["Interfaces"]
        PY["Python library"]
        CLI["aforge CLI"]
        WEB["Web UI (FastAPI + Next.js)"]
    end
    subgraph O["Orchestration"]
        DES["Designer: variant → routing → candidates → score → outcome → off-target → rank → report"]
    end
    subgraph D["Domain services"]
        VR["Variant resolver<br/>(HGVS, ClinVar)"]
        EN["Guide enumerators<br/>(cas9, base, prime)"]
        SC["Scoring<br/>(efficiency, outcome, uncertainty)"]
        OT["Off-target engine<br/>(population / haplotype)"]
    end
    subgraph F["Foundations"]
        GA["Genome access<br/>(FASTA, FM-index)"]
        DR["Data registry<br/>(DVC, gnomAD, ClinVar)"]
        MZ["Model zoo<br/>(ckpt hashing)"]
        CT["Core types and schemas"]
    end
    RUST["Rust / PyO3 — aforge_native: BWT off-target search · k-mer hashing · haplotype walking"]

    I --> O --> D --> F
    OT -.calls.-> RUST
    EN -.calls.-> RUST

The variant-first journey

sequenceDiagram
    actor U as User
    participant R as Resolver
    participant Rt as Router
    participant E as Enumerators
    participant S as Scorers
    participant X as Off-target engine
    participant K as Ranker

    U->>R: ClinVar / rsID / HGVS / VCF / coords
    R->>Rt: normalized Variant + consequence
    Rt->>E: eligible modalities (nuclease / base / prime)
    E->>S: candidate guides and pegRNAs
    S->>S: efficiency + outcome (calibrated Prediction)
    E->>X: spacers / nicks
    X->>X: reference → population → haplotype → patient VCF
    S->>K: scored candidates
    X->>K: ancestry-stratified off-target reports
    K-->>U: RankedMenu (+ Pareto front, provenance, disclaimer)

Build status & roadmap

AlleleForge is built in ordered phases (see SPEC.md, the authoritative build contract). Phases 0–5 establish the spine before any modality or ML code.

Phase	Component	Status
0	Repo bootstrap, CI, packaging, Rust toolchain	✅ done
1	Core domain types & schemas (`types/`)	✅ done
2	Genome access & indexing (`genome/`)	✅ done
3	Data registry & population datasets (`data/`)	✅ done
4	Variant resolver (`variant/`)	✅ done
5	Off-target engine — population & haplotype aware (`offtarget/`)	✅ done
6	Scoring foundations: model zoo, embeddings, uncertainty (`scoring/`, `model_zoo/`)	✅ done
7	Chemistry: SpCas9 nuclease (`enumerate/`, `scoring/`, `design/`)	✅ done
8	Chemistry: base editing — ABE / CBE (`enumerate/`, `scoring/`, `design/`)	✅ done
9	Chemistry: prime editing	⏳ next
10	Designer: routing, candidate menu, ranking	◻️ planned
11	Reporting & oligo output	◻️ planned
12	CLI (`aforge`)	◻️ planned
13	Web UI & API	◻️ planned
14	CRISPR-Bench: benchmark, splits, leaderboard	◻️ planned
15	Docs, examples, release	◻️ planned

Install

AlleleForge targets Python ≥ 3.11. The core install is deliberately light; heavy scientific, ML, and web stacks live in optional dependency groups so the base package installs fast and CI stays reliable.

# Core library (light: pydantic types, config, model-card parsing — no torch/numpy)
pip install alleleforge            # once published to PyPI

# From source, with the optional groups you need
git clone https://github.com/clay-good/alleleforge
cd alleleforge
pip install -e ".[core,genome,variant,ml,dev]"

Optional dependency groups

Group	Pulls in	Needed for
`core`	polars, pyarrow, numpy	tabular I/O
`genome`	pyfaidx, pysam, cyvcf2, mappy, pyliftover	reference access, indexing (Phase 2)
`variant`	hgvs	HGVS resolution (Phase 4)
`ml`	torch, transformers, lightning, scikit-learn	real embedding backbones (Phase 6+); the uncertainty core needs none of these
`web`	fastapi, uvicorn	API server (Phase 13)
`docs`	mkdocs-material, mkdocstrings	documentation site
`dev`	ruff, mypy, pytest, hypothesis, maturin	development

Native acceleration (optional)

The performance kernels live in a PyO3 crate built with maturin. AlleleForge imports and runs cleanly without it (pure-Python mode); build it for speed:

pip install maturin
cd rust && maturin develop --release      # builds & installs aforge_native

alleleforge._native.NATIVE_AVAILABLE reports whether the compiled extension is present.

Quickstart

The end-to-end design pipeline lands incrementally across the modality phases (6–12). Today the package exposes the core vocabulary, genome access, the data registry, and the variant resolver — the entire front half of the variant-first journey. The snippets below work now; the full design() call arrives as the modality phases complete.

from alleleforge.types import DNASequence, Prediction, UncertaintyMethod

seq = DNASequence("ACGTRYN")           # validates IUPAC alphabet
print(seq.reverse_complement())        # ambiguity-aware: R↔Y, N↔N → "NRYACGT"

# Every numeric prediction carries a calibrated interval, never a bare float.
p = Prediction(value=0.72, interval=(0.61, 0.83), method=UncertaintyMethod.ENSEMBLE,
               in_distribution=True, calibrated=True)
print(p.interval_level)                # 0.80 by default

Resolve a variant — every input form normalizes to one canonical, left-aligned record:

from alleleforge.variant import resolve, RawTarget
from alleleforge.types import DNASequence

# A raw target sequence with a marked edit — no reference file needed.
rv = resolve(RawTarget(sequence=DNASequence("ACGTAACGTACGT"), position=4, ref="A", alt="G"))
print(rv.variant)            # target:4:A>G
print(rv.working_interval)   # 0-based half-open analysis window around it

# With a reference genome, indels are left-aligned and the asserted ref is
# validated against the build (a mismatch is a hard error — likely wrong build):
#   resolve("chr2:g.5226001del", reference=hg38, dbsnp=dbsnp_db)
#   resolve("VCV000012345", clinvar=clinvar_db)   # ClinVar accession → Variant

Inspect the data registry — every external dataset is versioned and license-aware:

from alleleforge.data import DEFAULT_REGISTRY

print(DEFAULT_REGISTRY.names)                 # ('1000g', 'clinvar', 'dbsnp', 'encode', ...)
clinvar = DEFAULT_REGISTRY.get("clinvar")
print(clinvar.version, clinvar.license)       # 2024-05  public-domain (NCBI)
# Non-redistributable sources are never vendored; downloads are consent-gated
# and checksum-verified. See docs/data.md for the full provenance table.

The target journey (Phase 12 CLI):

# Variant → ranked, safety-annotated menu of candidate edits
aforge design --clinvar VCV000012345 --intent correct --populations all

# Standalone population/haplotype-aware off-target for a spacer
aforge offtarget --spacer GACGGAGGCTAAGCGTCGCAA --pam NGG

# Normalize any input form and show its consequence (debugging aid)
aforge resolve --hgvs "NM_000518.5:c.20A>T"

The variant-first front end (Phases 2–4, shipping now)

Phases 2–4 implement everything from an input to a validated, annotated variant with its genomic context — the foundation every modality plugs into.

flowchart LR
    subgraph IN["Accepted inputs"]
        A1["ClinVar accession"]
        A2["dbSNP rsID"]
        A3["HGVS g./c./p."]
        A4["VCF record"]
        A5["raw coordinates"]
        A6["raw target seq"]
    end
    R["resolve()"]
    subgraph NORM["Normalize"]
        N1["left-align + trim<br/>(bcftools-norm)"]
        N2["validate ref vs build<br/>(hard error on mismatch)"]
    end
    OUT["ResolvedVariant<br/>variant · working interval ·<br/>consequence · T2T recommendation"]

    A1 & A2 & A3 & A4 & A5 & A6 --> R --> NORM --> OUT
    R -. ClinVar/dbSNP/HGVS lookups .- DATA["Data registry<br/>(versioned, license-aware)"]
    NORM -. fetch + flag ambiguous loci .- GEN["Genome access<br/>(FASTA, FM-index, liftover)"]

Coordinate convention cheat-sheet. Internals are uniformly 0-based half-open; only I/O boundaries are 1-based. Every parser converts on read.

Surface	System	Converted by
AlleleForge internals (`GenomicInterval`, `Variant.pos`)	0-based half-open	— (canonical)
ClinVar / gnomAD / dbSNP VCF	1-based	`pos − 1` on read
GENCODE GTF	1-based inclusive	`[start − 1, end)` on read
ENCODE bedGraph	0-based half-open	unchanged
HGVS (`g.`), human-readable reports	1-based	boundary helpers only

Dataset provenance (pinned, versioned, citation-stamped — full table in docs/data.md):

Dataset	Version	License	Role
ClinVar	2024-05	Public domain	accession → variant + significance
gnomAD	v4.1	CC0-1.0	per-population allele frequencies
1000 Genomes	phase 3 high-cov	Public (IGSR)	phased common haplotypes
HGDP	gnomAD v3.1	CC0-1.0	ancestry breadth
dbSNP	b156	Public domain	rsID ↔ locus
GENCODE	v47	Open	gene models / transcripts
ENCODE	2024	Open	chromatin tracks

The off-target engine (Phase 5, shipping now)

AlleleForge's safety core, and its clearest point of novelty: off-target nomination that is reference-, population-, and haplotype-aware for every chemistry, behind one search() call that returns an ancestry-stratified report. Reference-only off-target analysis has a known blind spot — a minor allele can create a de novo PAM the reference never shows — and because allele frequencies differ by ancestry, that blind spot concentrates risk in under-represented populations.

flowchart TB
    SP["spacer + PAM"] --> S1
    subgraph ENG["search() — five stages"]
        direction TB
        S1["1 · Reference scan<br/>PAM-anchored · ≤4 mismatch · ≤1 DNA + ≤1 RNA bulge · both strands"]
        S2["2 · Population augmentation<br/>gnomAD alt-allele re-scan → de-novo PAMs / strengthened seed sites"]
        S3["3 · Haplotype walk<br/>common 1000G / HGDP haplotypes (variant combinations)"]
        S4["4 · Patient VCF (optional)<br/>personalize to one genome"]
        S5["5 · Score · threshold · de-dup · stratify"]
        S1 --> S5
        S2 --> S5
        S3 --> S5
        S4 --> S5
    end
    S5 --> R["OffTargetReport<br/>ancestry-stratified · every site tagged<br/>reference / population / patient + causal allele + freq"]

Every site records where it came from — the reference, a population variant (which allele, which populations, at what frequency), or a patient's VCF — so a nomination can be audited, not trusted blindly. The report's worst-case is computed against the worst-affected ancestry, never the average.

Reference bias, reproduced

The canonical cautionary tale is the BCL11A enhancer variant rs114518452 (Cancellieri & Pinello, Nat Genet 2023). AlleleForge reproduces it as an integration test: a reference-only scan returns zero sites, while the population-aware scan nominates the high-CFD off-target the minor allele creates — ancestry-stratified, with its African-ancestry-enriched frequency recorded.

from alleleforge.offtarget import search
from alleleforge.types.guide import PAM

report = search(spacer, PAM(pattern="NGG"), reference=hg38, gnomad=gnomad_db)
for site in report.sites:
    print(site.origin, round(site.score, 2), site.causal_allele, site.populations)
worst = report.worst_ancestry()        # ('afr', 1.0) — flagged, not averaged away

Specificity scoring cheat-sheet

Score	Source	Status in AlleleForge
MIT / Hsu	Hsu et al., Nat Biotechnol 2013	Exact — published 20-position weight table
CFD	Doench et al., Nat Biotechnol 2016	Published PAM table; mismatch weights default to a transparent seed model, injectable with the exact Doench matrix
CFD-Cas12a	analog	Seed at the PAM-proximal 5' end, `TTTV` PAM

All three sit behind one swappable OffTargetScorer protocol, so a Phase 6 ML scorer drops in without touching the engine. Reporting thresholds default to CFD ≥ 0.20 or MIT ≥ 0.10.

The genome-scale search is the Rust FM-index seed-and-extend kernel; until that crate is built, AlleleForge ships a correct pure-Python linear-scan fallback (CI never blocks on the native build).

The scoring substrate (Phase 6, shipping now)

Before any chemistry-specific predictor, AlleleForge establishes the reusable ML substrate: a license-gated model zoo, a swappable embedding backbone, and the calibrated-uncertainty machinery that realizes the honest-uncertainty principle. The whole substrate is pure stdlib in its core path — no numpy or torch — so it runs in CI on a weight-free stub embedder; real 500M-parameter backbones are gated behind the real_weights marker.

flowchart LR
    SEQ["DNA sequence"] --> EMB["SequenceEmbedder<br/>(NT v2 · Caduceus · Evo 2 · Stub)"]
    EMB --> CACHE["embedding cache<br/>(by sequence hash)"]
    EMB --> OOD["OODDetector<br/>distance vs training reference"]
    CACHE --> MODEL["scorer / ensemble"]
    MODEL --> U{"uncertainty"}
    U -->|N=5 default| ENS["deep ensemble<br/>mean ± z·σ (disagreement)"]
    U -->|fallback| EV["evidential<br/>aleatoric + epistemic"]
    U -->|if quantiles| QT["quantile interval"]
    ENS & EV & QT --> CAL["isotonic calibration<br/>(reduces ECE)"]
    OOD --> CAL
    CAL --> PRED["Prediction[float]<br/>value · 80% interval · method ·<br/>in_distribution · calibrated"]

No bare floats. Every scorer returns a Prediction, never a number; ensure_prediction is the runtime guard at the orchestration seam. No undocumented models. Every checkpoint loads through the model zoo, which refuses a missing card, a license that forbids the use, or an unverifiable hash, and surfaces a ModelCheckpoint into result provenance.

Uncertainty method cheat-sheet

Method	Role	Interval
Deep ensemble (N=5)	default	`mean ± z·σ` from member disagreement — widens on OOD
Evidential (NIG)	single-model fallback	splits aleatoric (data) vs epistemic (model) variance
Quantile	when the model emits quantiles	read off the `(1±level)/2` quantiles
Isotonic calibration	post-hoc, all of the above	PAV fit; `expected_calibration_error` quantifies the gain

from alleleforge.scoring import DeepEnsemble, ensemble_prediction, OODDetector, StubEmbedder

ens = DeepEnsemble([m1, m2, m3, m4, m5])                 # five members
emb = StubEmbedder().embed(["GACCATGCAACCTTGAACGT"])[0]   # NT v2 in production
ood = OODDetector(training_reference)                     # embedding-space density
pred = ensemble_prediction(ens.predict(features), in_distribution=ood.is_in_distribution(emb))
print(pred.value, pred.interval, pred.method, pred.in_distribution)   # honest by construction

The first chemistry: SpCas9 nuclease (Phase 7, shipping now)

The most mature chemistry, and the right one to prove the full vertical slice end to end. From a resolved variant, design_cas9 enumerates guides, scores efficiency and outcome with calibrated uncertainty, runs the population-aware off-target engine, and returns ranked candidates.

flowchart LR
    V["ResolvedVariant<br/>+ intent"] --> EN["enumerate_cas9<br/>PAM-anchored · strand-aware ·<br/>cut 3 bp 5' of PAM · actionable window"]
    EN --> EF["efficiency<br/>RS3 baseline / deep ensemble<br/>(80% interval + OOD)"]
    EN --> OUT["outcome<br/>microhomology / MMEJ +<br/>1-bp insertion spectrum"]
    EN --> OT["off-target<br/>(Phase 5 engine,<br/>ancestry-stratified)"]
    EF & OUT & OT --> C["DesignCandidate[]<br/>ranked: efficiency then safety"]
    EN -.precise intent.-> HDR["HDR donor template"]

Defaults & decisions. Primary PAM NGG; NG (SpCas9-NG) and NRN/NYN (SpRY) are emitted only when no NGG guide is actionable and opted in. Cut site 3 bp 5' of the PAM. The actionable window is tight around the edit for precise intents (HDR efficiency falls off with cut-to-edit distance) and the whole working interval for a knock-out, which marks frameshift outcomes as intended.

Axis	Default (CI, weight-free)	Trained alternative (model zoo, `ml` extra)
Efficiency	RS3-style feature baseline + backbone deep ensemble	Rule Set 3; fine-tuned NT v2 ensemble
Outcome	microhomology/MMEJ + 1-bp insertion model	inDelphi (default) · Lindel · X-CRISP + agreement
Off-target	Phase 5 engine (pure-Python fallback)	Phase 5 engine (Rust FM-index)

Every efficiency score carries an 80% interval and an OOD flag; every outcome is a normalized distribution over indel alleles; every candidate carries an ancestry-stratified off-target report — so a ranked menu is honest about what it does and does not know.

Base editing: the bystander problem (Phase 8, shipping now)

Base editors install a single transition (ABE: A·T→G·C; CBE: C·G→T·A) without a double-strand break, within a narrow activity window. The hard part is the window outcome: of the editable bases in the window, which get edited — and what bystanders ride along. AlleleForge enumerates every sgRNA placing the target base in-window per editor, predicts the window-allele distribution, and ranks by the probability of the exact intended allele while minimizing bystander burden.

flowchart LR
    V["ResolvedVariant<br/>(transition SNV)"] --> EL{"editor eligible?<br/>ABE: A·T→G·C<br/>CBE: C·G→T·A"}
    EL --> EN["enumerate_base_edits<br/>target base in window 4–8 ·<br/>strand-aware · bystanders flagged"]
    EN --> WO["window outcome<br/>per-position p(edit) × motif →<br/>2ᵏ allele distribution"]
    WO --> M["p_intended_exact<br/>+ bystander_burden"]
    EN --> OT["off-target<br/>(Phase 5, ancestry-stratified)"]
    M & OT --> C["DesignCandidate[]<br/>ranked: clean-edit then bystander<br/>cleanest = recommended"]

Declarative editor registry. ABE8e, CBE4max, and evoCDA1 ship as data; adding an editor (deaminase, chemistry, window, PAM, motif preference) is a one-descriptor change, not code.

Editor	Deaminase	Edit	Window	Motif preference
ABE8e	TadA-8e	A→G	4–8	none (broad)
CBE4max	APOBEC1	C→T	4–8	TC (prefers 5′ T)
evoCDA1	evoCDA1	C→T	2–10	none (broad window)

Every candidate carries the tradeoff explicitly — bystander-present:N / clean, a bystander-burden score, the full window-allele distribution, and an ancestry-stratified off-target report — so the recommendation is the cleanest editor/guide combination, not just the first one found.

Defaults cheat-sheet

Every default is overridable; these are the spec-mandated starting points.

Topic	Default	Notes
Reference / coordinates	hg38, 0-based half-open	T2T-CHM13 auto-recommended for ambiguous loci; mm39 for mouse
Strand	always explicit	no implicit "default strand"; spacers stored 5'→3'
SpCas9 PAM	`NGG` (primary), `NAG` low-stringency	NG / SpRY opt-in when no NGG is actionable
Off-target search	≤ 4 mismatches, ≤ 1 DNA + ≤ 1 RNA bulge	report CFD ≥ 0.20 or MIT ≥ 0.10
Population inclusion	MAF ≥ 0.001, all populations	de-novo PAM & seed-mismatch changes always evaluated
Base-editing window	protospacer positions 4–8	ABE8e (A→G), CBE4max / evoCDA1 (C→T); bystanders always reported
Prime editing	PE5max + epegRNA (tevopreQ1)	PBS 8–17 nt, RTT 7–34 nt; PE3b nicking guide when seed-disrupting
Uncertainty	80% predictive interval	deep ensemble (N=5) + isotonic calibration
Seed	`20240501`	threaded through every stochastic step, recorded in provenance

Project layout

alleleforge/
├── pyproject.toml            # hatchling build, deps, ruff/mypy/pytest config
├── SPEC.md                   # the authoritative, phase-by-phase build contract
├── rust/                     # PyO3 crate: aforge_native (BWT, k-mer, haplotype)
├── src/alleleforge/
│   ├── config.py             # typed Settings (pydantic-settings), defaults, paths
│   ├── _native.py            # optional Rust bridge
│   ├── types/                # Phase 1: core domain vocabulary
│   ├── genome/               # Phase 2: reference access, FM-index, liftover
│   ├── data/                 # Phase 3: registry, ClinVar, gnomAD, 1000G/HGDP, dbSNP, annotations
│   ├── variant/              # Phase 4: resolver, HGVS adapter, consequence
│   ├── offtarget/            # Phase 5: population/haplotype-aware off-target
│   ├── model_zoo/            # Phase 6: license-gated model cards + checkpoints
│   ├── scoring/              # Phase 6: embeddings, uncertainty, Scorer (this release)
│   ├── enumerate/            # Phases 7–8: SpCas9 guide + base-editor window enumeration
│   ├── design/               # Phases 7–8: SpCas9 + base-editor verticals (designer: Phase 10)
│   ├── report/ cli/ web/                   # Phases 11–13 (interfaces)
│   └── ...
├── tests/                    # mirrors src/; pytest + hypothesis
├── benchmark/                # CRISPR-Bench (Phase 14)
└── docs/                     # mkdocs-material site

Development

pip install -e ".[dev]"
ruff check src tests           # lint + import order + docstrings
ruff format --check src tests  # formatting
mypy src                       # strict type-check
pytest                         # tests + ≥85% coverage gate on core
cd rust && cargo test && maturin develop   # native crate

CI (GitHub Actions) runs lint, type-check, tests (Python 3.11 + 3.12 on Linux & macOS), the Rust build, and a docs build on every push and PR. See .github/workflows/ci.yml.

Contributions are welcome — please read CONTRIBUTING.md and the Contributor Covenant 2.1 code of conduct.

Scope & responsible use

Research use only. AlleleForge produces hypotheses and rankings, not medical advice or clinical decisions. Every generated report repeats this.
Off-target predictions require experimental validation. Computational nomination narrows the search; it does not replace GUIDE-seq / CHANGE-seq / amplicon confirmation.
No telemetry, no phone-home. All computation runs locally or on user-controlled infrastructure. User sequences are never transmitted externally.
Honest uncertainty over false confidence. Where models are out of distribution (e.g., prime-editing efficiency outside PRIDICT's HEK293T / K562 training context), AlleleForge flags it rather than hiding it.
Dual-use awareness. This is a design and safety-analysis tool for legitimate therapeutic and basic research. It contains no wet-lab protocols or synthesis instructions.

License

AlleleForge is released under the MIT License — all code, schemas, benchmark, and any first-party model weights. It is fully open source and free to use, modify, and redistribute.

Each wrapped third-party tool or model retains its own upstream license, recorded in its model/tool card; the registry refuses to bundle any component whose license is incompatible with redistribution and fetches it at runtime with the user's consent instead.

Citation

If you use AlleleForge, please cite it via CITATION.cff. A Zenodo DOI is minted on the first tagged release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlleleForge

Why AlleleForge

The four-axis gap it fills

Design principles

Architecture

The variant-first journey

Build status & roadmap

Install

Optional dependency groups

Native acceleration (optional)

Quickstart

The variant-first front end (Phases 2–4, shipping now)

The off-target engine (Phase 5, shipping now)

Reference bias, reproduced

Specificity scoring cheat-sheet

The scoring substrate (Phase 6, shipping now)

Uncertainty method cheat-sheet

The first chemistry: SpCas9 nuclease (Phase 7, shipping now)

Base editing: the bystander problem (Phase 8, shipping now)

Defaults cheat-sheet

Project layout

Development

Scope & responsible use

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
docs		docs
rust		rust
scripts		scripts
src/alleleforge		src/alleleforge
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
conftest.py		conftest.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

AlleleForge

Why AlleleForge

The four-axis gap it fills

Design principles

Architecture

The variant-first journey

Build status & roadmap

Install

Optional dependency groups

Native acceleration (optional)

Quickstart

The variant-first front end (Phases 2–4, shipping now)

The off-target engine (Phase 5, shipping now)

Reference bias, reproduced

Specificity scoring cheat-sheet

The scoring substrate (Phase 6, shipping now)

Uncertainty method cheat-sheet

The first chemistry: SpCas9 nuclease (Phase 7, shipping now)

Base editing: the bystander problem (Phase 8, shipping now)

Defaults cheat-sheet

Project layout

Development

Scope & responsible use

License

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages