Follow-up from PR #346.
PR #346 made cleanup.sh unconditionally rm -f /host/run/nvidia/validations/toolkit-ready on preStop. On a hypothetical node where both nvml-mock and real nvidia-container-toolkit coexist, our cleanup would yank a marker that the real toolkit owns — taking down legit operand pods.
Status: mixed-mode (mock + real toolkit on the same node) is not a supported configuration today, and the existing /run/nvidia/driver symlink cleanup has the same hazard. So this is a defensive measure, not an active bug.
Ask: evaluate a sibling sentinel approach. For example:
setup.sh writes /host/run/nvidia/validations/toolkit-ready AND /host/run/nvidia/validations/.nvml-mock-owned
cleanup.sh only removes toolkit-ready when .nvml-mock-owned is present (then removes the sentinel too)
- Either as a dedicated PR, or rolled into a broader mock-vs-real coexistence story
Could also apply to the /run/nvidia/driver symlink (cleanup currently checks [ -L ... ] but doesn't verify ownership). Worth deciding whether to fix both consistently.
Surfaced by the Principal Engineer reviewer pass on PR #346. Lower priority than the E2E follow-up; only matters once mixed-mode becomes a real deployment shape.
Follow-up from PR #346.
PR #346 made
cleanup.shunconditionallyrm -f /host/run/nvidia/validations/toolkit-readyon preStop. On a hypothetical node where bothnvml-mockand realnvidia-container-toolkitcoexist, our cleanup would yank a marker that the real toolkit owns — taking down legit operand pods.Status: mixed-mode (mock + real toolkit on the same node) is not a supported configuration today, and the existing
/run/nvidia/driversymlink cleanup has the same hazard. So this is a defensive measure, not an active bug.Ask: evaluate a sibling sentinel approach. For example:
setup.shwrites/host/run/nvidia/validations/toolkit-readyAND/host/run/nvidia/validations/.nvml-mock-ownedcleanup.shonly removestoolkit-readywhen.nvml-mock-ownedis present (then removes the sentinel too)Could also apply to the
/run/nvidia/driversymlink (cleanup currently checks[ -L ... ]but doesn't verify ownership). Worth deciding whether to fix both consistently.Surfaced by the Principal Engineer reviewer pass on PR #346. Lower priority than the E2E follow-up; only matters once mixed-mode becomes a real deployment shape.