Prerequisites
Code of Conduct
Bug Description
node-drainer can process stale AlreadyQuarantined health events after fault-quarantine has already unquarantined the node and removed the quarantine annotation.
When this happens, node-drainer logs No quarantine annotation found for node, but treats that as "not already drained" and falls through to normal drain evaluation. If no evictable pods remain, it marks the stale event's userpodsevictionstatus as Succeeded and updates the node state label, even though there is no active quarantine context.
Timeline
2026-06-22T14:57:59Z
Event 6a394d77cb03a8deb6ea786f
checkName=SysLogsXIDError
isHealthy=false
isFatal=true
message="Xid 79, GPU has fallen off the bus"
nodequarantined=Quarantined
userpodsevictionstatus=Cancelled
This was the event that initially quarantined the node.
2026-06-22T15:28:37Z
Event 6a3954a53213d318af75f41e
checkName=GpuDcgmConnectivityFailure
isHealthy=false
isFatal=true
message="Failed to connect to DCGM for health check"
nodequarantined=AlreadyQuarantined
The node was already quarantined, so this event was recorded as AlreadyQuarantined.
2026-06-22T15:39:01Z
Event 6a39571599a8340849547ed2
checkName=GpuDcgmConnectivityFailure
isHealthy=false
isFatal=true
message="Failed to connect to DCGM for health check"
nodequarantined=AlreadyQuarantined
Another fatal DCGM connectivity event was also recorded as AlreadyQuarantined.
2026-06-22T15:40:57Z
Event 6a39578999a8340849547ed3
checkName=SysLogsXIDError
isHealthy=true
isFatal=false
message="No Health Failures"
nodequarantined=""
userpodsevictionstatus=""
A healthy SysLogsXIDError event arrived. fault-quarantine removed recovered entities for SysLogsXIDError, but kept the node quarantined because GpuDcgmConnectivityFailure was still failing:
2026-06-22T15:40:57Z fault-quarantine
Removed recovered entities for check on node
check=SysLogsXIDError
removedCount=2
remainingEntities=1
2026-06-22T15:40:57Z fault-quarantine
Node remains quarantined with failing checks
failingChecksCount=1
checks=[GpuDcgmConnectivityFailure]
2026-06-22T15:41:51Z
Event 6a3957bf99a8340849547ed6
checkName=GpuDcgmConnectivityFailure
isHealthy=true
isFatal=false
message="DCGM connectivity reported no errors"
nodequarantined=UnQuarantined
userpodsevictionstatus=Succeeded
A healthy GpuDcgmConnectivityFailure event arrived and cleared the last remaining failing check. fault-quarantine emitted UnQuarantined.
2026-06-22T15:41:52Z node-drainer
Detected UnQuarantined event, marking all in-progress events for node as cancelled
2026-06-22T16:03:21Z node-drainer
Event was cancelled, performing cleanup
eventID=6a394d77cb03a8deb6ea786f
2026-06-22T16:03:21Z node-drainer
Health event status has been updated
documentID=6a394d77cb03a8deb6ea786f
evictionStatus=Cancelled
node-drainer correctly cancelled the original SysLogsXIDError drain.
Later, after the node was unquarantined and the quarantine annotation was removed, the earlier AlreadyQuarantined DCGM events were processed by node-drainer again:
2026-06-23T23:05:56Z node-drainer
No quarantine annotation found for node
All pods evicted successfully on node
Evaluated action for node action=UpdateStatus
Health event status has been updated documentID=6a3954a53213d318af75f41e evictionStatus=Succeeded
2026-06-23T23:05:57Z node-drainer
No quarantine annotation found for node
All pods evicted successfully on node
Evaluated action for node action=UpdateStatus
Labeling node from=remediation-failed to=drain-succeeded
Invalid state transition from=remediation-failed to=drain-succeeded
Label updated successfully for node
Health event status has been updated documentID=6a39571599a8340849547ed2 evictionStatus=Succeeded
Expected Behavior
AlreadyQuarantined events should not be drained after the quarantine annotation is gone. Missing quarantine annotation for an AlreadyQuarantined event should be treated as a stale/no-active-quarantine condition and should skip or cancel the event.
Component
Fault Management
Steps to Reproduce
- Quarantine a node from a fatal health event, for example
SysLogsXIDError.
- While the node is quarantined, create additional fatal health events for the same node, for example
GpuDcgmConnectivityFailure, so they are marked AlreadyQuarantined.
- Emit healthy events for the tracked failing checks.
- Allow
fault-quarantine to remove the quarantine annotation and record UnQuarantined.
- Allow queued or requeued
AlreadyQuarantined events to be processed by node-drainer after the annotation is removed.
- Observe that
node-drainer logs No quarantine annotation found for node, then proceeds to mark stale events as Succeeded and updates the node state label.
Environment
- NVSentinel version: v1.9.0
- Kubernetes version:
- Deployment method:
Logs/Output
The key log sequence is:
No quarantine annotation found for node
All pods evicted successfully on node
Evaluated action for node action=UpdateStatus
Health event status has been updated evictionStatus=Succeeded
This happened for stale AlreadyQuarantined events after the node had already been unquarantined.
Prerequisites
Code of Conduct
Bug Description
node-drainercan process staleAlreadyQuarantinedhealth events afterfault-quarantinehas already unquarantined the node and removed the quarantine annotation.When this happens,
node-drainerlogsNo quarantine annotation found for node, but treats that as "not already drained" and falls through to normal drain evaluation. If no evictable pods remain, it marks the stale event'suserpodsevictionstatusasSucceededand updates the node state label, even though there is no active quarantine context.Timeline
This was the event that initially quarantined the node.
The node was already quarantined, so this event was recorded as
AlreadyQuarantined.Another fatal DCGM connectivity event was also recorded as
AlreadyQuarantined.A healthy
SysLogsXIDErrorevent arrived.fault-quarantineremoved recovered entities forSysLogsXIDError, but kept the node quarantined becauseGpuDcgmConnectivityFailurewas still failing:A healthy
GpuDcgmConnectivityFailureevent arrived and cleared the last remaining failing check.fault-quarantineemittedUnQuarantined.node-drainercorrectly cancelled the originalSysLogsXIDErrordrain.Later, after the node was unquarantined and the quarantine annotation was removed, the earlier
AlreadyQuarantinedDCGM events were processed bynode-draineragain:Expected Behavior
AlreadyQuarantinedevents should not be drained after the quarantine annotation is gone. Missing quarantine annotation for anAlreadyQuarantinedevent should be treated as a stale/no-active-quarantine condition and should skip or cancel the event.Component
Fault Management
Steps to Reproduce
SysLogsXIDError.GpuDcgmConnectivityFailure, so they are markedAlreadyQuarantined.fault-quarantineto remove the quarantine annotation and recordUnQuarantined.AlreadyQuarantinedevents to be processed bynode-drainerafter the annotation is removed.node-drainerlogsNo quarantine annotation found for node, then proceeds to mark stale events asSucceededand updates the node state label.Environment
Logs/Output
The key log sequence is:
This happened for stale
AlreadyQuarantinedevents after the node had already been unquarantined.