Model Registry — AGENTS.md

Read the root AGENTS.md for project-wide rules. This file covers only conventions specific to transformer_lens/tools/model_registry/.

TL;DR

Use verify_models, not main_benchmark. Only verify_models writes data/supported_models.json. main_benchmark runs the same math but defaults to NOT writing the registry (needs --update-registry, and even with it misses Phase 7/8 scores and the resume checkpoint). If you ran main_benchmark, the registry is stale.

update_model_status() in registry_io.py is the only mutator of status/phase/note on existing entries. Never set by hand.
Adding a new model-ID entry is allowed (required before verify_models --model <repo> can find it). See Adding a new model entry.
Never run in parallel — single CUDA/MPS device OOMs (AGENTS.md §10).
HF token required for parameter estimation on gated models. Source .env: set -a; source .env; set +a.

Canonical invocations

Goal	Command
Verify one specific model + update registry	`uv run python -m transformer_lens.tools.model_registry.verify_models --model <hf_repo>`
Verify N models per architecture family	`uv run python -m transformer_lens.tools.model_registry.verify_models --architectures <HFClassName> --per-arch <n>`
Verify N models across all architectures	`uv run python -m transformer_lens.tools.model_registry.verify_models --per-arch <n>`
Resume after Ctrl-C / crash	Re-run the same command with `--resume` (reads `data/verification_checkpoint.json`)
Re-verify already-verified models for an arch	`--reverify --architectures <HFClassName>`
See what would run without doing it	Add `--dry-run`
Restrict to specific phases	`--phases 1 2 3`
Override device / dtype / memory cap	`--device cuda --dtype float32 --max-memory 16`

HFClassName matches the strings in HF_SUPPORTED_ARCHITECTURES (see __init__.py) — e.g. LlamaForCausalLM, GPT2LMHeadModel, Olmo2ForCausalLM.

Flag reference

Flag	Meaning
`--model <repo>`	Verify a single HF repo (must already exist as an entry in `supported_models.json`)
`--architectures <ClassName...>`	Restrict to one or more HF architecture classes
`--per-arch <n>`	Verify the top-N unverified models per architecture (default 10)
`--limit <n>`	Cap total models verified across all architectures
`--device <cpu\|cuda\|mps>`	Override automatic device selection
`--dtype <float32\|bfloat16>`	Override automatic dtype selection
`--max-memory <gb>`	Skip models whose parameter-count estimate exceeds this GB cap (default: tries every model that fits available device memory). Use this to avoid OOM on a small device — e.g. `--max-memory 16` on a 24 GB GPU leaves head-room for activations.
`--phases <n...>`	Restrict to specific phases (default `1 2 3 4`; Phase 7/8 are auto-skipped for non-applicable architectures)
`--resume`	Read `data/verification_checkpoint.json` and skip models already tested in the in-flight run
`--reverify`	Re-test already-verified models (default skips status=1 entries)
`--retry-failed`	Re-test status=3 (failed) entries
`--dry-run`	Print what would be tested without running
`--no-hf-reference` / `--no-ht-reference`	Skip the HF / HT comparison passes (faster, lower confidence)
`--quiet`	Suppress per-model logging

File roles

File	Role
`verify_models.py`	Canonical CLI for batch verification + registry updates
`registry_io.py`	I/O for `supported_models.json`; `update_model_status()` is the only writer
`verification.py`	`VerificationRecord` / `VerificationHistory` dataclasses (audit-trail schema)
`validate.py`	JSON-schema validation for registry files
`api.py`	Read-only programmatic access (`is_model_supported`, `get_architecture_models`, …)
`schemas.py`	Dataclasses for model entries, scan info, architecture stats
`exceptions.py`	Custom exception types
`alias_drift.py`	Detects when legacy `MODEL_ALIASES` and the registry have diverged
`discover_architectures.py`	Lightweight HF scan to enumerate architecture classes
`hf_scraper.py`	Full HF Hub scan; builds initial supported/unsupported model lists
`relevancy.py`	Filters models by download count, foundation-org provenance
`generate_report.py`	Renders human-readable status summaries; holds `ARCHITECTURE_DESCRIPTIONS`

__init__.py exports the canonical HF_SUPPORTED_ARCHITECTURES set and CANONICAL_AUTHORS_BY_ARCH map; agents adding a new HF architecture must update both.

Adding a new model entry

To verify a model not yet in data/supported_models.json, hand-add the entry first. This is the only allowed hand-edit:

{
  "architecture_id": "MyArchForCausalLM",
  "model_id": "org/repo-name",
  "status": 0,
  "verified_date": null, "metadata": null, "note": null,
  "phase1_score": null, "phase2_score": null, "phase3_score": null,
  "phase4_score": null, "phase7_score": null, "phase8_score": null
}

verify_models --model org/repo-name then populates status/score/note via update_model_status(). Never set those fields manually.

`data/verification_checkpoint.json` (gitignored)

Resume state for long-running runs (tested/verified/failed/skipped IDs + timestamp):

Ctrl-C → SIGINT handler finishes current model, persists checkpoint, exits cleanly.
--resume reads it, skips already-tested models.
Deleted on successful full run; missing/corrupt → fresh run (safe).

Never edit manually.

Phase reference

verify_models runs the model through phases and writes per-phase scores back into the registry entry. Phases (some don't apply to every architecture — see applicable_phases on the adapter):

Phase	Checks
1	Core forward correctness vs HuggingFace logits
2	Hook firing + gradient flow
3	Weight processing (compatibility mode, fold/centre)
4	Text-generation quality
7	Multimodal (vision/text alignment) — only Llava / Gemma3-multimodal
8	Audio — only Hubert

Phase-score thresholds

verify_models enforces hard pass/fail at the thresholds in _MIN_PHASE_SCORES (verify_models.py:508). Below threshold OR a required-test failure → STATUS_FAILED. The contract:

Phase	Min score	Required tests	Effect when below threshold or required tests fail
1	100%	—	`STATUS_FAILED`
2	75%	`logits_equivalence`, `loss_equivalence`	`STATUS_FAILED`
3	75%	`logits_equivalence`, `loss_equivalence`	`STATUS_FAILED`
4	50%	—	Non-gating. Below 50% adds `"low text quality"` to the registry `note`; never causes `STATUS_FAILED`.
7	75%	`multimodal_forward`	`STATUS_FAILED`. NULL score (processor unavailable) also fails.
8	75%	`audio_forward`	`STATUS_FAILED`. NULL score also fails.

Phase 4 is intentionally lenient — source (verify_models.py:554) calls it "a quality metric, not a correctness check." The 50% bar asks "is the text coherent at all?" not "is this adapter clean?"

For adapter authors: a STATUS_VERIFIED entry with P4 well below 100% on a small parity-test model can still indicate a real bug the system doesn't gate on (e.g. missing preprocess_weights fold). Investigate manually even when VERIFIED.

Reading the result:

status==1 + note="Full verification completed" → all gates passed, no quality flag. Good.
status==1 + note mentions "low text quality" → P4 < 50%; investigate.
status==1 + P4 < 100% on a small model, no quality flag → potential weight-fold/tokenizer bug; investigate.
status==3 (FAILED) → note carries the failure reason; debug from there.

P1/P3 failures: supported_architectures/AGENTS.md §When to override preprocess_weights, debugging_numerical_divergence.md. P4 drift: §Tokenizer policy (logit-scale / embedding-scale folds typically degrade P4 without crossing the 50% gate).

Hard "don'ts"

No main_benchmark for registry updates — misses P7/P8, no checkpoint, no registry write without --update-registry.
No parallel verify_models — device OOM (AGENTS.md §10).
No manual edits to existing entries' status/verified_date/note/phaseN_score — only update_model_status() writes those. (New entries OK — see Adding a new model entry.)
No deleting data/verification_checkpoint.json mid-run — let SIGINT clean up.
No skipping .env — gated-model verification needs HF_TOKEN.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Registry — AGENTS.md

TL;DR

Canonical invocations

Flag reference

File roles

Adding a new model entry

`data/verification_checkpoint.json` (gitignored)

Phase reference

Phase-score thresholds

Hard "don'ts"

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Model Registry — AGENTS.md

TL;DR

Canonical invocations

Flag reference

File roles

Adding a new model entry

data/verification_checkpoint.json (gitignored)

Phase reference

Phase-score thresholds

Hard "don'ts"

`data/verification_checkpoint.json` (gitignored)