Skip to content

nvml-mock: ownership sentinel for /run/nvidia/validations/toolkit-ready to prevent mixed-mode rm #349

@ArangoGutierrez

Description

@ArangoGutierrez

Follow-up from PR #346.

PR #346 made cleanup.sh unconditionally rm -f /host/run/nvidia/validations/toolkit-ready on preStop. On a hypothetical node where both nvml-mock and real nvidia-container-toolkit coexist, our cleanup would yank a marker that the real toolkit owns — taking down legit operand pods.

Status: mixed-mode (mock + real toolkit on the same node) is not a supported configuration today, and the existing /run/nvidia/driver symlink cleanup has the same hazard. So this is a defensive measure, not an active bug.

Ask: evaluate a sibling sentinel approach. For example:

  • setup.sh writes /host/run/nvidia/validations/toolkit-ready AND /host/run/nvidia/validations/.nvml-mock-owned
  • cleanup.sh only removes toolkit-ready when .nvml-mock-owned is present (then removes the sentinel too)
  • Either as a dedicated PR, or rolled into a broader mock-vs-real coexistence story

Could also apply to the /run/nvidia/driver symlink (cleanup currently checks [ -L ... ] but doesn't verify ownership). Worth deciding whether to fix both consistently.

Surfaced by the Principal Engineer reviewer pass on PR #346. Lower priority than the E2E follow-up; only matters once mixed-mode becomes a real deployment shape.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureFeature request or enhancement.priority/p2P2: minor defect or perf implication. No fix-time commitment.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions