Skip to content

feat(nvml-mock): nvidia-persistenced + Xid health-remediation loop #378

@giuliocalzo

Description

@giuliocalzo

Priority: Low

Context. Failure injection (#328) emits Xid via the NVML event set, but the downstream remediation loop isn't exercised.

Gap. No nvidia-persistenced presence and no E2E covering NPD / device-plugin health then node cordon/drain on Xid.

Proposed scope.

  • Optional fake nvidia-persistenced socket/presence.
  • E2E: inject a critical Xid and assert the device-plugin health monitor (or NPD) marks the GPU Unhealthy / node gets cordoned.

Why. Validates the operational response to failures the mock can already inject. Low priority.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions