Priority: Low
Context. Failure injection (#328) emits Xid via the NVML event set, but the downstream remediation loop isn't exercised.
Gap. No nvidia-persistenced presence and no E2E covering NPD / device-plugin health then node cordon/drain on Xid.
Proposed scope.
- Optional fake
nvidia-persistenced socket/presence.
- E2E: inject a critical Xid and assert the device-plugin health monitor (or NPD) marks the GPU
Unhealthy / node gets cordoned.
Why. Validates the operational response to failures the mock can already inject. Low priority.
Priority: Low
Context. Failure injection (#328) emits Xid via the NVML event set, but the downstream remediation loop isn't exercised.
Gap. No
nvidia-persistencedpresence and no E2E covering NPD / device-plugin health then node cordon/drain on Xid.Proposed scope.
nvidia-persistencedsocket/presence.Unhealthy/ node gets cordoned.Why. Validates the operational response to failures the mock can already inject. Low priority.