Read the root AGENTS.md for project-wide rules. This file covers conventions specific to tests/.
Just running tests? Run
make test-prfor the standard PR-review surface (unit + docstring + acceptance + integration).make unit-testfor fast feedback. Full tier breakdown in Test tiers below.
- Tier placement matters — a unit test loading a model belongs in
integration/. See tier table. - No mocking model loads or HF Hub. Session-scoped fixtures amortize the cost.
- Cached models for fast tests (
gpt2,attn-only-{1,2,3,4}l,tiny-stories-1M). Anything else →@pytest.mark.slow. - MPS is a carve-out —
TRANSFORMERLENS_ALLOW_MPS=1required; onlytests/mps/runs there. - AGENTS.md §10 applies: no
xfail/skipifto dodge CI; no platform skips outside MPS. - Check QUARANTINES.md before debugging a failing test — known quarantines have documented reasons.
| Tier | Path | Run | Loads models? | Hits HF Hub? | Scope | Example |
|---|---|---|---|---|---|---|
unit |
tests/unit/ |
make unit-test |
None / synthetic (rare exceptions) | No | Function or single module | tests/unit/test_key_value_cache_entry.py |
integration |
tests/integration/ |
make integration-test |
1–2 cached models, module-scoped | Yes | Cross-component | tests/integration/test_generation_compatibility.py |
acceptance |
tests/acceptance/ |
make acceptance-test |
Full models (gpt2, bloom-560m), session-scoped |
Yes | End-to-end behaviour | tests/acceptance/conftest.py |
benchmarks |
tests/benchmarks/ |
make benchmark-test |
Varies; performance focus | Yes | Throughput / memory | tests/benchmarks/test_boot_memory.py |
mps |
tests/mps/ |
pytest tests/mps -v (needs TRANSFORMERLENS_ALLOW_MPS=1) |
TinyStories-1M, fp32 only | Yes | macOS-MPS smoke only | tests/mps/test_mps_basic.py |
Common combinations: make test-pr (unit + docstring + acceptance + integration — the PR-review surface), make test (everything including benchmarks + notebooks).
Rule of thumb: new tests that load a model should land in integration/ by default. The unit/ tier has a few legitimate model-loading exceptions (e.g. test_bridge_vs_hooked_transformer_*.py compares numerics across architectures, which is conceptually unit-scoped) — match that pattern only when the test really is testing isolated behaviour that happens to need a model.
tests/conftest.py — root, provides:
cleanup_memory(function autouse),cleanup_class_memory— CUDA/MPS cache + GC_enable_hf_retry_for_tests(session autouse) — wraps HFfrom_pretrainedwith 429 retry- Seeded RNG (numpy/torch/Python @ 42)
gpt2_tokenizer(session)gpt2_hooked_processed(session)temp_dir
Sub-folder conftests:
| Path | Provides |
|---|---|
tests/acceptance/conftest.py |
gpt2_model, bloom_560m_hooked, bloom_560m_hf_model, bloom_560m_hf_tokenizer (all session) |
tests/acceptance/model_bridge/conftest.py |
Bridge variants of gpt2 with/without compat mode |
tests/integration/model_bridge/conftest.py |
distilgpt2 + gpt2 Bridge variants × {compat, no-compat, no-processing} |
Two cross-cutting rules:
- All
transformer_lensimports inside conftest fixtures live in fixture bodies, not at module top — jaxtyping'spytest_configurehook must install before the package is first imported. - Session-scoped model fixtures (
gpt2_hooked_processed,gpt2_bridge, …) are read-only — mutating them leaks across the entire test session.
CI cache (checks.yml) covers: gpt2, gpt2-xl, distilgpt2, pythia-70m, gpt-neo-125M, gemma-2-2b-it, bloom-560m, Qwen2-0.5B, bert-base-cased, NeelNanda/Attn_Only*, roneneldan/TinyStories-1M*, NeelNanda/SoLU*, redwood_attn_2l, tiny-random-llama-2, DialoGPT-medium.
Prefer attn-only-{1,2,3,4}l and tiny-stories-1M for fast tests — gpt2 is slow on CI's CPU runners. Use gpt2 only when you need GPT-2 numerics. Anything outside the cached set → @pytest.mark.slow.
pyproject.toml: "slow: marks tests as slow (deselect with '-m \"not slow\"')". Add when the test:
- loads a non-cached model
- iterates exhaustively over many param combos
- takes >5 s per invocation
Deselect with pytest -m "not slow". Default make targets do NOT filter; the marker is for ad-hoc runs.
mps-checkssetsTRANSFORMERLENS_ALLOW_MPS=1and runstests/unit,tests/integration,tests/mpsonmacos-latest.get_device()returns"cpu"unlessTRANSFORMERLENS_ALLOW_MPS=1— protects against silent MPS divergence.- The workflow's long
--ignore=list documents existing MPS divergence (MoE, optimizer compat, KV-cache layout); it's not a license to add new skips. tests/mps/test_mps_basic.pyis the template: float32 only (no bfloat16 on MPS), TinyStories-1M only (50 MB fits the runner),torch.mps.empty_cache()+gc.collect()between tests.- MPS-only modules need:
pytestmark = pytest.mark.skipif(not torch.backends.mps.is_available(), reason="MPS not available").
Plus AGENTS.md §10:
- No mocking model loads — session-scoped fixtures are cheap enough.
- No mocking the HF Hub — tests hit the real hub with
enable_hf_retry()handling 429s. - No platform
skipifoutside MPS — noskipif(sys.platform == 'win32')orskipif(not torch.cuda.is_available())to dodge CI. - No
xfailto dodge a failing test — fix the bug, even if pre-existing. - No copying acceptance-tier tests as unit-test templates — their model fixtures time out / OOM at the unit tier.