docs: document test fixture system and add tests/AGENTS.md

dacorvo · claude · dacorvo · commit c7704b125222 · 2026-03-11T14:24:54.000Z
Add comprehensive documentation for the export_models fixture system:
- Module-level docstring covering caching strategy, invalidation rules,
  fixture usage patterns, and CLI usage
- Inline docstrings for key functions
- New tests/AGENTS.md with agent-facing guidance on test infrastructure,
  pre-export workflow, expected test durations, and compiler cache policy
- Reference tests/AGENTS.md from root AGENTS.md context loading section
- Fix stale path reference to export_models.py

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/AGENTS.md b/AGENTS.md
@@ -14,6 +14,7 @@ directories, but must be read manually when working from the project root:
 - `optimum/neuron/models/inference/backend/modules/attention/AGENTS.md` — attention or NKI kernel work
 - `optimum/neuron/models/inference/<model>/AGENTS.md` — model-specific work (gemma3, llama, qwen3, etc.)
 - `optimum/neuron/vllm/AGENTS.md` — vLLM integration work
+- `tests/AGENTS.md` — test infrastructure, fixtures, and cache management
 
 When adding a new model, create a `CLAUDE.md` containing `@AGENTS.md` in its directory
 so this auto-loading applies to it automatically.
@@ -115,7 +116,7 @@ All test workflows follow the same pattern:
 - Static shapes: runtime input shapes must match compiled shapes.
 - Export and load in separate processes to avoid device conflicts.
 - Neuron runtime does not release devices reliably within the same process.
-- Decoder graph changes require cache prune when using the fixtures defined under `tests/fixtures/export_models.py`: `python tools/prune_test_models.py`.
+- Decoder graph changes require cache prune when using the fixtures defined under `tests/fixtures/llm/export_models.py`: `python tools/prune_test_models.py`.
 
 ## Environment Variables
 
diff --git a/tests/AGENTS.md b/tests/AGENTS.md
@@ -0,0 +1,114 @@
+# Test Infrastructure Agent Guide
+
+Read this before working on any test code under `tests/`.
+
+## Directory Structure
+
+```
+tests/
+  conftest.py                  # Root conftest — registers fixture plugins via pytest_plugins
+  fixtures/llm/
+    export_models.py           # Model export fixtures + CLI (see detailed docs in the module docstring)
+    vllm_service.py            # vLLM OpenAI-compatible service launcher fixture
+    vllm_docker_service.py     # vLLM Docker service launcher fixture
+  decoder/                     # LLM decoder tests (generation, export, pipelines, modules, etc.)
+    conftest.py                # Reorders @subprocess_test tests to run before session fixtures
+  vllm/                        # vLLM integration tests
+    engine/                    # Direct engine API tests
+    service/                   # OpenAI-compatible serving tests
+    docker/                    # Docker container tests
+```
+
+## Fixture Registration
+
+Fixtures in `tests/fixtures/llm/` are registered as pytest plugins in `tests/conftest.py`:
+
+```python
+pytest_plugins = [
+    "fixtures.llm.vllm_docker_service",
+    "fixtures.llm.vllm_service",
+    "fixtures.llm.export_models",
+]
+```
+
+This makes `neuron_llm_config`, `any_generate_model`, and `speculation` available to all tests.
+
+## The Export Models Fixture System
+
+Full documentation is in the `tests/fixtures/llm/export_models.py` module docstring. Key points:
+
+### Model Configurations
+
+Each model is compiled for two shapes: `(batch_size=4, seq_len=1024)` and `(batch_size=1, seq_len=8192)`.
+Config names follow the pattern `<model>-<batch>x<seq>`, e.g. `llama-4x1024`.
+
+### Using Fixtures in Tests
+
+**`neuron_llm_config`** — for tests that need a specific model config:
+```python
+@pytest.mark.parametrize("neuron_llm_config", ["llama-4x1024"], indirect=True)
+def test_something(neuron_llm_config: dict[str, Any]):
+    model = NeuronModelForCausalLM.from_pretrained(neuron_llm_config["neuron_model_path"])
+```
+
+**`any_generate_model`** — for tests that should run across all generation models:
+```python
+def test_greedy_expectations(any_generate_model):
+    model_path = any_generate_model["neuron_model_path"]
+    config_name = any_generate_model["name"]  # e.g. "llama-4x1024"
+```
+
+Both yield a dict with keys: `name`, `model_id`, `task`, `export_kwargs`, `neuron_model_id`, `neuron_model_path`.
+
+### Hub Caching
+
+Compiled models are pushed to private HF Hub repos. The repo name encodes all invalidation keys:
+`<org>/optimum-neuron-testing-<version>-<sdk>-<instance>-<code_hash>-<config_name>`.
+
+The code hash changes when `pyproject.toml` or anything under `optimum/neuron/models/inference/` changes.
+Old repos must be pruned manually: `python tools/prune_test_models.py`.
+
+## Always Pre-Export Models Before Running Tests
+
+**CI always runs `python tests/fixtures/llm/export_models.py` as a separate step before any pytest invocation** (see `.github/workflows/test_inf2_llm.yml` and `test_inf2_vllm.yml`). You must do the same locally:
+
+```bash
+# Export all models (or use a pattern like 'llama*')
+python tests/fixtures/llm/export_models.py
+
+# Then run tests
+pytest -sv tests/decoder/test_decoder_generation.py
+```
+
+If you skip the pre-export step, fixtures will auto-export on first use. This causes:
+- **Long hangs**: compilation takes 10-30+ minutes per model, making it hard to tell if a test is stuck or just compiling.
+- **NeuronCore conflicts**: the compilation process may conflict with subprocess-isolated tests that also need device access.
+
+### Expected Test Durations
+
+Based on CI logs (inf2.8xlarge, models pre-exported), entire test groups complete within:
+
+| Test group | Duration |
+|---|---|
+| LLM utils / hub / CLI / embedding | < 1 min each |
+| LLM export tests | ~4 min |
+| LLM generation tests | ~7 min |
+| LLM pipeline tests | ~5 min |
+| LLM module tests (NKI kernels) | ~19 min |
+| LLM cache tests | ~5 min |
+| vLLM engine generation | ~20 min |
+| vLLM service tests | ~15 min |
+
+An individual test should complete within **2 minutes**. If a test hangs longer than that, the most likely cause is a missing pre-export triggering compilation inside the fixture. Pre-export, then re-run.
+
+## Never Wipe the Neuron Compiler Cache
+
+The Neuron compiler cache is **content-addressed**: each compiled NEFF is keyed by the SHA hash of the HLO graph that produced it. The hash space is large enough to make collisions practically impossible.
+
+**There is no such thing as a "stale compiler cache entry."** If the HLO graph changes (because you changed model code), the hash changes, and a new entry is created. The old entry is simply never matched again — it does no harm.
+
+Wiping the compiler cache (e.g. `rm -rf /var/tmp/neuron-compile-cache`) only forces expensive recompilation with zero benefit. **Never suggest or perform cache deletion as a debugging step.**
+
+## Subprocess Test Ordering
+
+`tests/decoder/conftest.py` contains a `pytest_collection_modifyitems` hook that moves `@subprocess_test`-decorated tests to run **before** all other tests. This prevents session-scoped fixtures (which load models onto NeuronCores) from blocking subprocess tests that need device access in a child process.
diff --git a/tests/fixtures/llm/export_models.py b/tests/fixtures/llm/export_models.py
@@ -1,3 +1,93 @@
+"""Pytest fixtures and CLI for provisioning compiled Neuron LLM test models.
+
+Overview
+--------
+This module provides session-scoped pytest fixtures that compile (export) HF models
+to Neuron NEFFs and make them available as local directories during test runs.  It is
+registered as a pytest plugin via ``pytest_plugins`` in ``tests/conftest.py``.
+
+Model configurations
+--------------------
+Two dictionaries define every (model, batch_size, sequence_length) combination that
+will be compiled:
+
+- ``GENERATE_LLM_MODEL_CONFIGURATIONS`` — text-generation models, built from
+  ``GENERATE_LLM_MODEL_IDS`` × [(4, 1024), (1, 8192)].
+- ``EMBED_LLM_MODEL_CONFIGURATIONS`` — embedding models, built from
+  ``EMBED_LLM_MODEL_IDS`` × [(4, 8192), (6, 8192)].
+
+Configuration names follow the pattern ``<model>-<batch_size>x<sequence_length>``,
+e.g. ``llama-4x1024``.  The merged dict ``LLM_MODEL_CONFIGURATIONS`` is the union of
+both.
+
+Caching strategy
+----------------
+Compiled models are expensive to produce (10-30+ min each on Neuron hardware).  To
+avoid recompilation, every exported model is pushed to a private HF Hub repo whose
+name encodes all the variables that would change the compilation output::
+
+    <org>/optimum-neuron-testing-<version>-<sdk_version>-<instance_type>-<code_hash>-<config_name>
+
+The ``<code_hash>`` (see ``get_neuron_models_hash()``) is a truncated SHA-256 of the
+git tree hashes of ``pyproject.toml`` and ``optimum/neuron/models/inference/``.  When
+*any* file inside those paths changes (even on an unrelated branch), the hash changes,
+causing a fresh export on next run.
+
+Compiled artifacts are also synchronized to a shared cache repo
+(``optimum-internal-testing/neuron-testing-cache``) so that the Neuron compiler cache
+on other machines can hit them.
+
+Cache invalidation
+------------------
+The hub repo name changes (and a re-export is triggered) when **any** of these change:
+
+1. ``optimum-neuron`` package version (``optimum.neuron.version.__version__``).
+2. Neuron SDK version (``optimum.neuron.version.__sdk_version__``).
+3. Instance type (e.g. ``inf2.8xlarge`` vs ``trn1.32xlarge``).
+4. Git content of ``pyproject.toml`` or ``optimum/neuron/models/inference/``.
+
+Old hub repos are **not** auto-deleted.  Prune them manually::
+
+    python tools/prune_test_models.py [--version <ver>] [--pattern <pat>] [--yes]
+
+Fixtures
+--------
+``any_generate_model``
+    Parametrized over *all* generation configs.  Each test using this fixture runs
+    once per config.  Use for broad cross-model validation (e.g. greedy expectations).
+
+``neuron_llm_config``
+    Provides a *single* config, chosen via ``@pytest.mark.parametrize`` with
+    ``indirect=True``::
+
+        @pytest.mark.parametrize("neuron_llm_config", ["llama-4x1024"], indirect=True)
+        def test_something(neuron_llm_config):
+            model_path = neuron_llm_config["neuron_model_path"]
+
+    Defaults to the first config (``llama-4x1024``) if no param is given.
+
+``speculation``
+    Session-scoped fixture that provides a ``(model_path, draft_model_path)`` tuple
+    for speculative decoding tests.
+
+All fixtures yield a ``dict`` with keys: ``name``, ``model_id``, ``task``,
+``export_kwargs``, ``neuron_model_id``, ``neuron_model_path``.
+
+CLI usage
+---------
+Run this file directly to pre-export models before running tests (this is what CI
+does)::
+
+    # Export all models
+    python tests/fixtures/llm/export_models.py
+
+    # Export only llama configs
+    python tests/fixtures/llm/export_models.py 'llama*'
+
+    # List available configs
+    python tests/fixtures/llm/export_models.py --list
+"""
+
 import copy
 import hashlib
 import logging
@@ -83,6 +173,13 @@
 
 
 def get_neuron_models_hash():
+    """Compute a short content hash that changes when inference code or build config changes.
+
+    Uses ``git ls-tree HEAD`` to get the tree SHA of ``pyproject.toml`` and the
+    ``optimum/neuron/models/inference/`` directory, then combines them into a
+    truncated SHA-256.  This means any file change inside those paths — even on a
+    feature branch — produces a different hash and forces a re-export of test models.
+    """
     import subprocess
 
     res = subprocess.run(["git", "rev-parse", "--show-toplevel"], capture_output=True, text=True)
@@ -106,14 +203,20 @@ def get_sha(path):
 
 
 def _get_hub_neuron_model_prefix():
+    """Build the HF Hub repo name prefix that encodes all invalidation keys.
+
+    Format: ``<org>/optimum-neuron-testing-<version>-<sdk>-<instance>-<code_hash>``
+    """
     return f"{TEST_HUB_ORG}/optimum-neuron-testing-{version}-{sdk_version}-{current_instance_type()}-{get_neuron_models_hash()}"
 
 
 def _get_hub_neuron_model_id(config_name: str, model_config: dict[str, str]):
+    """Return the full HF Hub repo id for a specific model configuration."""
     return f"{_get_hub_neuron_model_prefix()}-{config_name}"
 
 
 def _export_model(model_id, task, export_kwargs, neuron_model_path):
+    """Compile a model to Neuron NEFFs and save to ``neuron_model_path``."""
     if task == "text-generation":
         auto_class = NeuronModelForCausalLM
     elif task == "feature-extraction":
@@ -221,6 +324,12 @@ def neuron_llm_config(request):
 
 @pytest.fixture(scope="session")
 def speculation():
+    """Provide compiled target + draft models for speculative decoding tests.
+
+    Yields a ``(neuron_model_path, draft_neuron_model_path)`` tuple.  The target
+    model is compiled with ``speculation_length=5``; the draft model is a standard
+    single-token model.  Both use ``batch_size=1, sequence_length=4096``.
+    """
     model_id = "unsloth/Llama-3.2-1B-Instruct"
     neuron_model_id = f"{_get_hub_neuron_model_prefix()}-speculation"
     draft_neuron_model_id = f"{_get_hub_neuron_model_prefix()}-speculation-draft"