You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: runtime fingerprinting, identity verification, and lockfile (#19)
* Add runtime fingerprinting, lockfile, and pre-submit validation
Recipes are moving toward being self-contained documents of intent.
This adds the infrastructure for reproducibility and environment tracking:
- **Fingerprint module** (`core/fingerprint.py`): Captures pip freeze, GPU
info, CUDA/torch/NCCL versions inside running containers. Deterministic
output (sorted packages, fixed key order) for clean diffs.
- **Lockfile** (`core/lockfile.py`): Aggregates per-worker fingerprints into
`recipe.lock.yaml` written to the output directory after each run.
- **Pre-submit validation** (`core/validation.py`): Background checks that
HF models exist, Docker images resolve, and local paths are real.
Fire-and-forget — never blocks job submission.
- **Schema**: Optional `name`, `revision`, `container_image`, `container_digest`
fields on ModelConfig for virtual identity tracking.
- **CLI commands**: `srtctl diff` to compare two runs, `srtctl check` to
verify environment against a reference fingerprint.
- **Worker preamble**: Fingerprint capture injected after setup/pip install,
before server launch — captures the real runtime state.
All fault-tolerant: every probe, check, and write can fail independently
without affecting the job. 133 new tests, all existing tests unaffected.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add SLURM context to lockfile and write early at job start
The lockfile now captures the SLURM environment (job ID, account,
partition, nodelist, user, cwd) in the _meta.slurm section. This is
written at the start of the sweep so even crashed jobs have a lockfile
with config + cluster context. The postprocess stage rewrites it with
the aggregated runtime fingerprint after workers complete.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Store per-worker fingerprints instead of aggregating
Each worker (prefill_w0, decode_w0, etc.) keeps its own fingerprint
in the lockfile rather than being unioned into one blob. Prefill and
decode nodes can have different GPU types, drivers, and packages —
collapsing them hides real differences.
srtctl diff now compares each worker against its counterpart between
runs. srtctl check verifies each worker independently.
Backward compatible: old lockfiles with a single 'fingerprint' key
are loaded as {"worker": fingerprint}.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use heredoc for fingerprint capture script to avoid escaping bugs
The inline python3 -c approach produced literal \n characters instead of
newlines when passing through bash → srun → bash → python. This caused a
SyntaxError that was silently swallowed by || true, so fingerprints were
never actually collected.
Fix: write the capture script via a bash heredoc (cat <<'EOF') and pipe
to python3 via process substitution. This is immune to quoting/escaping
issues in the srun chain.
Also add two new tests:
- test_embedded_python_is_syntactically_valid: ast.parse() the extracted
Python source to catch syntax errors at test time
- test_embedded_python_produces_json: actually execute the script in a
subprocess to verify it runs end-to-end
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use python3 -m pip freeze for better package capture, resolve log_dir in lockfile
- pip freeze misses system-installed packages in containers; python3 -m pip
freeze is more reliable across environments
- lockfile now records the resolved log_dir path instead of the template
string (./outputs/{job_id}/logs), so the lockfile is self-contained
- Added test for resolved_log_dir in lockfile
* feat: capture framework versions (vllm/sglang/trtllm/dynamo) in fingerprint
Adds a 'frameworks' section to the runtime fingerprint that probes for
vllm, sglang, tensorrt_llm, and dynamo versions inside the container.
Only detected frameworks are included.
Also adds virtual identity fields (name, container_image) to the mocker
recipe as an example of how to document pullable origins for
reproducibility.
* feat: capture container and model identity in runtime fingerprint
Adds two new probes to the fingerprint script:
- container_identity: reads enroot/Pyxis image metadata to capture the
original Docker digest and image env vars from the running container
- model_identity: reads HF download metadata (commit hash, repo ID) and
config.json model ID from the mounted model directory
This enables post-hoc verification that what actually ran matches what
the recipe declared (model.name/revision, container_image/digest).
* refactor: replace torch_version with frameworks dict, drop unverifiable container identity
- Removed container_identity() probe — Pyxis/enroot stores zero provenance
metadata inside containers (confirmed by inspection on ptyche GB200)
- Removed torch_version as a standalone field — torch version is now captured
inside the frameworks dict alongside vllm, sglang, tensorrt_llm, dynamo
- frameworks dict only includes detected frameworks (sparse)
- Added model_identity() probe for HF repo/revision from download metadata
- Updated pip freeze to use python3 -m pip freeze for better container compat
- Updated all tests to use new schema
* fix: probe venv Python and merge multiple pip freeze sources
- Auto-detect container Python venv (/opt/dynamo/venv/bin/python3) for
framework version probes — system python3 misses venv-installed packages
- Merge pip freeze output from venv python, system python, bare pip, and
uv pip freeze — different install methods show different packages
- Deduplicate across sources via set merge
* fix: label pip_packages by source instead of merging
pip_packages is now a dict keyed by source (e.g. '/opt/dynamo/venv/bin/python3',
'python3', 'pip', 'uv') so you can see which packages come from which
environment. Diff/check logic flattens for comparison.
* feat: add identity verification — compare recipe against runtime fingerprint
Adds identity: block to recipe schema with model.repo, model.revision,
and frameworks dict. After health check passes, the orchestrator loads
worker fingerprints and compares against identity declarations. Prints
a verification banner in the sweep log:
- All checks passed (with what was verified)
- WARNING: N mismatch(es) detected (with details)
Mismatches warn but don't fail the job.
* feat: show pass/fail for each identity check, fix HF metadata discovery
- Verification banner now shows OK/!! for each check explicitly
- Fixed model_identity probe to find HF commit hash from
.cache/huggingface/download/*.metadata (hf download --local-dir format)
- Added IdentityCheckResult dataclass for structured pass/fail results
* fix: model.repo as unverifiable when HF metadata lacks repo name, add revision to recipe
HF download --local-dir stores commit hashes in .cache/huggingface/download/*.metadata
but not the repo name. model.repo is now treated as 'declared, not verifiable' instead
of a failure when runtime can't determine the repo. model.revision check works correctly
against the cached commit hash.
* fix: use importlib.metadata for all framework probes, add tensorrt_llm to identity
import tensorrt_llm loads native CUDA extensions which crash without GPU
context. The fingerprint script runs before the worker starts, so GPU may
not be available. importlib.metadata.version() only reads package metadata
from dist-info — no native code, no GPU needed. Applied to all framework
probes (vllm, sglang, tensorrt_llm, torch, dynamo).
* feat: show what's running at submit time, prompt for identity block if missing
After job submission, prints model/container/backend/benchmark summary.
If identity block is present, shows declared identity fields inline.
If missing, prints a yellow tip with example identity block to encourage
runtime verification.
* feat: include verification results in lockfile, right after _meta
The lockfile now has a 'verification' section at the top (after _meta,
before config) showing the identity check results:
verification:
result: all OK
passed: 5
failed: 0
checks:
- field: model.repo
status: OK
message: nvidia/Kimi-K2.5-NVFP4 (declared, not verifiable at runtime)
- field: frameworks.tensorrt_llm
status: OK
message: 1.3.0rc9
This is the first thing you see when reading the lockfile.
* chore: remove torch from framework probes, keep vllm/sglang/trtllm/dynamo only
* docs: explicit identity tip showing available frameworks and where versions come from
* docs: clarify frameworks is dynamo + one engine, not all three
* feat: show running summary + identity tip in dry-run output too
* docs: add agent instruction to always include identity block in recipes
* docs: explain identity enables result replication
* fix: pre-submit HF validation reads from identity block, not just model config
* cleanup: remove dead container_image/digest/name fields from ModelConfig
- Removed name, revision, container_image, container_digest from ModelConfig
(all moved to identity block)
- Removed Docker image pre-submit validation (can't verify from inside Pyxis)
- HF validation now reads from identity.model.repo
- Updated tests to use IdentityConfig instead of old ModelConfig fields
- Note: background validation thread was effectively dead code — daemon thread
exits before HTTP completes. Left in place but it needs a rethink.
* cleanup: remove dead background validation thread
The daemon thread spawned by run_validations_background() was killed before
completing — srtctl apply exits immediately after sbatch. The real validation
now happens at runtime via identity verification in the orchestrator.
* feat: inline HF model validation at submit time (replaces dead background thread)
Single HTTP HEAD to huggingface.co/api/models/{repo} before sbatch.
Shows green checkmark or yellow warning. Takes <1s, never blocks on failure.
* feat: HF model validation runs in dry-run too, not just submit
* review: harden fingerprint PR after code review
- IdentityCheckResult now frozen (consistency with other dataclasses)
- Extract FRAMEWORK_PACKAGES constant (eliminates hardcoded duplicates)
- Remove hasattr(config, 'identity') checks (field always exists via default_factory)
- Reduce HF validation timeout 5s -> 2s (air-gapped clusters)
- Reduce bash script probe timeout 5s -> 3s (faster worker startup)
- Simplify find_python() (Path.exists() instead of subprocess)
* chore: remove design doc from branch
* feat: capture ML env vars in fingerprint with secret redaction
Captures CUDA_, TORCH_, NCCL_, VLLM_, SGLANG_, TRTLLM_, HF_, DYN_,
NVIDIA_, OMPI_, UCX_, NVSHMEM_ prefixed env vars. Redacts any variable
containing TOKEN, KEY, SECRET, PASSWORD, CREDENTIAL, or AUTH.
Inspired by dynamo's config_dump/environment.py but self-contained.
* feat: include srt-slurm git commit hash in lockfile metadata
* fix: _parse_pip_packages handles UNAVAILABLE sentinel string gracefully
When all pip freeze commands fail, pip_packages is set to the string
'unavailable'. Previously this was iterated character-by-character,
producing 8 garbage entries ('u': '?', 'n': '?', ...) that corrupted
diff/check output. Now returns empty dict for strings and None.
* fix: skip identity verification banner when no identity fields declared
IdentityConfig() is always truthy (it's a dataclass instance). Check
inner fields (model.repo, model.revision, frameworks) before running
verification, matching the pattern in submit.py.
* fix: address review findings — double walk, N captures, short prefix, bash doc
- validate_local_path: single directory walk instead of two rglob passes
- srtctl check: capture fingerprint once, reuse for all worker comparisons
- model.revision: require >= 7 chars to prevent false prefix matches
- Document bash requirement for process substitution in heredoc script
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments