Skip to content

Commit c7704b1

Browse files
dacorvoclaude
andcommitted
docs: document test fixture system and add tests/AGENTS.md
Add comprehensive documentation for the export_models fixture system: - Module-level docstring covering caching strategy, invalidation rules, fixture usage patterns, and CLI usage - Inline docstrings for key functions - New tests/AGENTS.md with agent-facing guidance on test infrastructure, pre-export workflow, expected test durations, and compiler cache policy - Reference tests/AGENTS.md from root AGENTS.md context loading section - Fix stale path reference to export_models.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b7a8216 commit c7704b1

File tree

3 files changed

+225
-1
lines changed

3 files changed

+225
-1
lines changed

AGENTS.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ directories, but must be read manually when working from the project root:
1414
- `optimum/neuron/models/inference/backend/modules/attention/AGENTS.md` — attention or NKI kernel work
1515
- `optimum/neuron/models/inference/<model>/AGENTS.md` — model-specific work (gemma3, llama, qwen3, etc.)
1616
- `optimum/neuron/vllm/AGENTS.md` — vLLM integration work
17+
- `tests/AGENTS.md` — test infrastructure, fixtures, and cache management
1718

1819
When adding a new model, create a `CLAUDE.md` containing `@AGENTS.md` in its directory
1920
so this auto-loading applies to it automatically.
@@ -115,7 +116,7 @@ All test workflows follow the same pattern:
115116
- Static shapes: runtime input shapes must match compiled shapes.
116117
- Export and load in separate processes to avoid device conflicts.
117118
- Neuron runtime does not release devices reliably within the same process.
118-
- Decoder graph changes require cache prune when using the fixtures defined under `tests/fixtures/export_models.py`: `python tools/prune_test_models.py`.
119+
- Decoder graph changes require cache prune when using the fixtures defined under `tests/fixtures/llm/export_models.py`: `python tools/prune_test_models.py`.
119120

120121
## Environment Variables
121122

tests/AGENTS.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Test Infrastructure Agent Guide
2+
3+
Read this before working on any test code under `tests/`.
4+
5+
## Directory Structure
6+
7+
```
8+
tests/
9+
conftest.py # Root conftest — registers fixture plugins via pytest_plugins
10+
fixtures/llm/
11+
export_models.py # Model export fixtures + CLI (see detailed docs in the module docstring)
12+
vllm_service.py # vLLM OpenAI-compatible service launcher fixture
13+
vllm_docker_service.py # vLLM Docker service launcher fixture
14+
decoder/ # LLM decoder tests (generation, export, pipelines, modules, etc.)
15+
conftest.py # Reorders @subprocess_test tests to run before session fixtures
16+
vllm/ # vLLM integration tests
17+
engine/ # Direct engine API tests
18+
service/ # OpenAI-compatible serving tests
19+
docker/ # Docker container tests
20+
```
21+
22+
## Fixture Registration
23+
24+
Fixtures in `tests/fixtures/llm/` are registered as pytest plugins in `tests/conftest.py`:
25+
26+
```python
27+
pytest_plugins = [
28+
"fixtures.llm.vllm_docker_service",
29+
"fixtures.llm.vllm_service",
30+
"fixtures.llm.export_models",
31+
]
32+
```
33+
34+
This makes `neuron_llm_config`, `any_generate_model`, and `speculation` available to all tests.
35+
36+
## The Export Models Fixture System
37+
38+
Full documentation is in the `tests/fixtures/llm/export_models.py` module docstring. Key points:
39+
40+
### Model Configurations
41+
42+
Each model is compiled for two shapes: `(batch_size=4, seq_len=1024)` and `(batch_size=1, seq_len=8192)`.
43+
Config names follow the pattern `<model>-<batch>x<seq>`, e.g. `llama-4x1024`.
44+
45+
### Using Fixtures in Tests
46+
47+
**`neuron_llm_config`** — for tests that need a specific model config:
48+
```python
49+
@pytest.mark.parametrize("neuron_llm_config", ["llama-4x1024"], indirect=True)
50+
def test_something(neuron_llm_config: dict[str, Any]):
51+
model = NeuronModelForCausalLM.from_pretrained(neuron_llm_config["neuron_model_path"])
52+
```
53+
54+
**`any_generate_model`** — for tests that should run across all generation models:
55+
```python
56+
def test_greedy_expectations(any_generate_model):
57+
model_path = any_generate_model["neuron_model_path"]
58+
config_name = any_generate_model["name"] # e.g. "llama-4x1024"
59+
```
60+
61+
Both yield a dict with keys: `name`, `model_id`, `task`, `export_kwargs`, `neuron_model_id`, `neuron_model_path`.
62+
63+
### Hub Caching
64+
65+
Compiled models are pushed to private HF Hub repos. The repo name encodes all invalidation keys:
66+
`<org>/optimum-neuron-testing-<version>-<sdk>-<instance>-<code_hash>-<config_name>`.
67+
68+
The code hash changes when `pyproject.toml` or anything under `optimum/neuron/models/inference/` changes.
69+
Old repos must be pruned manually: `python tools/prune_test_models.py`.
70+
71+
## Always Pre-Export Models Before Running Tests
72+
73+
**CI always runs `python tests/fixtures/llm/export_models.py` as a separate step before any pytest invocation** (see `.github/workflows/test_inf2_llm.yml` and `test_inf2_vllm.yml`). You must do the same locally:
74+
75+
```bash
76+
# Export all models (or use a pattern like 'llama*')
77+
python tests/fixtures/llm/export_models.py
78+
79+
# Then run tests
80+
pytest -sv tests/decoder/test_decoder_generation.py
81+
```
82+
83+
If you skip the pre-export step, fixtures will auto-export on first use. This causes:
84+
- **Long hangs**: compilation takes 10-30+ minutes per model, making it hard to tell if a test is stuck or just compiling.
85+
- **NeuronCore conflicts**: the compilation process may conflict with subprocess-isolated tests that also need device access.
86+
87+
### Expected Test Durations
88+
89+
Based on CI logs (inf2.8xlarge, models pre-exported), entire test groups complete within:
90+
91+
| Test group | Duration |
92+
|---|---|
93+
| LLM utils / hub / CLI / embedding | < 1 min each |
94+
| LLM export tests | ~4 min |
95+
| LLM generation tests | ~7 min |
96+
| LLM pipeline tests | ~5 min |
97+
| LLM module tests (NKI kernels) | ~19 min |
98+
| LLM cache tests | ~5 min |
99+
| vLLM engine generation | ~20 min |
100+
| vLLM service tests | ~15 min |
101+
102+
An individual test should complete within **2 minutes**. If a test hangs longer than that, the most likely cause is a missing pre-export triggering compilation inside the fixture. Pre-export, then re-run.
103+
104+
## Never Wipe the Neuron Compiler Cache
105+
106+
The Neuron compiler cache is **content-addressed**: each compiled NEFF is keyed by the SHA hash of the HLO graph that produced it. The hash space is large enough to make collisions practically impossible.
107+
108+
**There is no such thing as a "stale compiler cache entry."** If the HLO graph changes (because you changed model code), the hash changes, and a new entry is created. The old entry is simply never matched again — it does no harm.
109+
110+
Wiping the compiler cache (e.g. `rm -rf /var/tmp/neuron-compile-cache`) only forces expensive recompilation with zero benefit. **Never suggest or perform cache deletion as a debugging step.**
111+
112+
## Subprocess Test Ordering
113+
114+
`tests/decoder/conftest.py` contains a `pytest_collection_modifyitems` hook that moves `@subprocess_test`-decorated tests to run **before** all other tests. This prevents session-scoped fixtures (which load models onto NeuronCores) from blocking subprocess tests that need device access in a child process.

tests/fixtures/llm/export_models.py

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,93 @@
1+
"""Pytest fixtures and CLI for provisioning compiled Neuron LLM test models.
2+
3+
Overview
4+
--------
5+
This module provides session-scoped pytest fixtures that compile (export) HF models
6+
to Neuron NEFFs and make them available as local directories during test runs. It is
7+
registered as a pytest plugin via ``pytest_plugins`` in ``tests/conftest.py``.
8+
9+
Model configurations
10+
--------------------
11+
Two dictionaries define every (model, batch_size, sequence_length) combination that
12+
will be compiled:
13+
14+
- ``GENERATE_LLM_MODEL_CONFIGURATIONS`` — text-generation models, built from
15+
``GENERATE_LLM_MODEL_IDS`` × [(4, 1024), (1, 8192)].
16+
- ``EMBED_LLM_MODEL_CONFIGURATIONS`` — embedding models, built from
17+
``EMBED_LLM_MODEL_IDS`` × [(4, 8192), (6, 8192)].
18+
19+
Configuration names follow the pattern ``<model>-<batch_size>x<sequence_length>``,
20+
e.g. ``llama-4x1024``. The merged dict ``LLM_MODEL_CONFIGURATIONS`` is the union of
21+
both.
22+
23+
Caching strategy
24+
----------------
25+
Compiled models are expensive to produce (10-30+ min each on Neuron hardware). To
26+
avoid recompilation, every exported model is pushed to a private HF Hub repo whose
27+
name encodes all the variables that would change the compilation output::
28+
29+
<org>/optimum-neuron-testing-<version>-<sdk_version>-<instance_type>-<code_hash>-<config_name>
30+
31+
The ``<code_hash>`` (see ``get_neuron_models_hash()``) is a truncated SHA-256 of the
32+
git tree hashes of ``pyproject.toml`` and ``optimum/neuron/models/inference/``. When
33+
*any* file inside those paths changes (even on an unrelated branch), the hash changes,
34+
causing a fresh export on next run.
35+
36+
Compiled artifacts are also synchronized to a shared cache repo
37+
(``optimum-internal-testing/neuron-testing-cache``) so that the Neuron compiler cache
38+
on other machines can hit them.
39+
40+
Cache invalidation
41+
------------------
42+
The hub repo name changes (and a re-export is triggered) when **any** of these change:
43+
44+
1. ``optimum-neuron`` package version (``optimum.neuron.version.__version__``).
45+
2. Neuron SDK version (``optimum.neuron.version.__sdk_version__``).
46+
3. Instance type (e.g. ``inf2.8xlarge`` vs ``trn1.32xlarge``).
47+
4. Git content of ``pyproject.toml`` or ``optimum/neuron/models/inference/``.
48+
49+
Old hub repos are **not** auto-deleted. Prune them manually::
50+
51+
python tools/prune_test_models.py [--version <ver>] [--pattern <pat>] [--yes]
52+
53+
Fixtures
54+
--------
55+
``any_generate_model``
56+
Parametrized over *all* generation configs. Each test using this fixture runs
57+
once per config. Use for broad cross-model validation (e.g. greedy expectations).
58+
59+
``neuron_llm_config``
60+
Provides a *single* config, chosen via ``@pytest.mark.parametrize`` with
61+
``indirect=True``::
62+
63+
@pytest.mark.parametrize("neuron_llm_config", ["llama-4x1024"], indirect=True)
64+
def test_something(neuron_llm_config):
65+
model_path = neuron_llm_config["neuron_model_path"]
66+
67+
Defaults to the first config (``llama-4x1024``) if no param is given.
68+
69+
``speculation``
70+
Session-scoped fixture that provides a ``(model_path, draft_model_path)`` tuple
71+
for speculative decoding tests.
72+
73+
All fixtures yield a ``dict`` with keys: ``name``, ``model_id``, ``task``,
74+
``export_kwargs``, ``neuron_model_id``, ``neuron_model_path``.
75+
76+
CLI usage
77+
---------
78+
Run this file directly to pre-export models before running tests (this is what CI
79+
does)::
80+
81+
# Export all models
82+
python tests/fixtures/llm/export_models.py
83+
84+
# Export only llama configs
85+
python tests/fixtures/llm/export_models.py 'llama*'
86+
87+
# List available configs
88+
python tests/fixtures/llm/export_models.py --list
89+
"""
90+
191
import copy
292
import hashlib
393
import logging
@@ -83,6 +173,13 @@
83173

84174

85175
def get_neuron_models_hash():
176+
"""Compute a short content hash that changes when inference code or build config changes.
177+
178+
Uses ``git ls-tree HEAD`` to get the tree SHA of ``pyproject.toml`` and the
179+
``optimum/neuron/models/inference/`` directory, then combines them into a
180+
truncated SHA-256. This means any file change inside those paths — even on a
181+
feature branch — produces a different hash and forces a re-export of test models.
182+
"""
86183
import subprocess
87184

88185
res = subprocess.run(["git", "rev-parse", "--show-toplevel"], capture_output=True, text=True)
@@ -106,14 +203,20 @@ def get_sha(path):
106203

107204

108205
def _get_hub_neuron_model_prefix():
206+
"""Build the HF Hub repo name prefix that encodes all invalidation keys.
207+
208+
Format: ``<org>/optimum-neuron-testing-<version>-<sdk>-<instance>-<code_hash>``
209+
"""
109210
return f"{TEST_HUB_ORG}/optimum-neuron-testing-{version}-{sdk_version}-{current_instance_type()}-{get_neuron_models_hash()}"
110211

111212

112213
def _get_hub_neuron_model_id(config_name: str, model_config: dict[str, str]):
214+
"""Return the full HF Hub repo id for a specific model configuration."""
113215
return f"{_get_hub_neuron_model_prefix()}-{config_name}"
114216

115217

116218
def _export_model(model_id, task, export_kwargs, neuron_model_path):
219+
"""Compile a model to Neuron NEFFs and save to ``neuron_model_path``."""
117220
if task == "text-generation":
118221
auto_class = NeuronModelForCausalLM
119222
elif task == "feature-extraction":
@@ -221,6 +324,12 @@ def neuron_llm_config(request):
221324

222325
@pytest.fixture(scope="session")
223326
def speculation():
327+
"""Provide compiled target + draft models for speculative decoding tests.
328+
329+
Yields a ``(neuron_model_path, draft_neuron_model_path)`` tuple. The target
330+
model is compiled with ``speculation_length=5``; the draft model is a standard
331+
single-token model. Both use ``batch_size=1, sequence_length=4096``.
332+
"""
224333
model_id = "unsloth/Llama-3.2-1B-Instruct"
225334
neuron_model_id = f"{_get_hub_neuron_model_prefix()}-speculation"
226335
draft_neuron_model_id = f"{_get_hub_neuron_model_prefix()}-speculation-draft"

0 commit comments

Comments
 (0)