Skip to content

[Bug]: node-drainer can process stale AlreadyQuarantined events #1415

Description

@XRFXLP

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Bug Description

node-drainer can process stale AlreadyQuarantined health events after fault-quarantine has already unquarantined the node and removed the quarantine annotation.

When this happens, node-drainer logs No quarantine annotation found for node, but treats that as "not already drained" and falls through to normal drain evaluation. If no evictable pods remain, it marks the stale event's userpodsevictionstatus as Succeeded and updates the node state label, even though there is no active quarantine context.

Timeline

2026-06-22T14:57:59Z
Event 6a394d77cb03a8deb6ea786f
checkName=SysLogsXIDError
isHealthy=false
isFatal=true
message="Xid 79, GPU has fallen off the bus"
nodequarantined=Quarantined
userpodsevictionstatus=Cancelled

This was the event that initially quarantined the node.

2026-06-22T15:28:37Z
Event 6a3954a53213d318af75f41e
checkName=GpuDcgmConnectivityFailure
isHealthy=false
isFatal=true
message="Failed to connect to DCGM for health check"
nodequarantined=AlreadyQuarantined

The node was already quarantined, so this event was recorded as AlreadyQuarantined.

2026-06-22T15:39:01Z
Event 6a39571599a8340849547ed2
checkName=GpuDcgmConnectivityFailure
isHealthy=false
isFatal=true
message="Failed to connect to DCGM for health check"
nodequarantined=AlreadyQuarantined

Another fatal DCGM connectivity event was also recorded as AlreadyQuarantined.

2026-06-22T15:40:57Z
Event 6a39578999a8340849547ed3
checkName=SysLogsXIDError
isHealthy=true
isFatal=false
message="No Health Failures"
nodequarantined=""
userpodsevictionstatus=""

A healthy SysLogsXIDError event arrived. fault-quarantine removed recovered entities for SysLogsXIDError, but kept the node quarantined because GpuDcgmConnectivityFailure was still failing:

2026-06-22T15:40:57Z fault-quarantine
Removed recovered entities for check on node
check=SysLogsXIDError
removedCount=2
remainingEntities=1
2026-06-22T15:40:57Z fault-quarantine
Node remains quarantined with failing checks
failingChecksCount=1
checks=[GpuDcgmConnectivityFailure]
2026-06-22T15:41:51Z
Event 6a3957bf99a8340849547ed6
checkName=GpuDcgmConnectivityFailure
isHealthy=true
isFatal=false
message="DCGM connectivity reported no errors"
nodequarantined=UnQuarantined
userpodsevictionstatus=Succeeded

A healthy GpuDcgmConnectivityFailure event arrived and cleared the last remaining failing check. fault-quarantine emitted UnQuarantined.

2026-06-22T15:41:52Z node-drainer
Detected UnQuarantined event, marking all in-progress events for node as cancelled
2026-06-22T16:03:21Z node-drainer
Event was cancelled, performing cleanup
eventID=6a394d77cb03a8deb6ea786f
2026-06-22T16:03:21Z node-drainer
Health event status has been updated
documentID=6a394d77cb03a8deb6ea786f
evictionStatus=Cancelled

node-drainer correctly cancelled the original SysLogsXIDError drain.
Later, after the node was unquarantined and the quarantine annotation was removed, the earlier AlreadyQuarantined DCGM events were processed by node-drainer again:

2026-06-23T23:05:56Z node-drainer
No quarantine annotation found for node
All pods evicted successfully on node
Evaluated action for node action=UpdateStatus
Health event status has been updated documentID=6a3954a53213d318af75f41e evictionStatus=Succeeded
2026-06-23T23:05:57Z node-drainer
No quarantine annotation found for node
All pods evicted successfully on node
Evaluated action for node action=UpdateStatus
Labeling node from=remediation-failed to=drain-succeeded
Invalid state transition from=remediation-failed to=drain-succeeded
Label updated successfully for node
Health event status has been updated documentID=6a39571599a8340849547ed2 evictionStatus=Succeeded

Expected Behavior

AlreadyQuarantined events should not be drained after the quarantine annotation is gone. Missing quarantine annotation for an AlreadyQuarantined event should be treated as a stale/no-active-quarantine condition and should skip or cancel the event.

Component

Fault Management

Steps to Reproduce

  1. Quarantine a node from a fatal health event, for example SysLogsXIDError.
  2. While the node is quarantined, create additional fatal health events for the same node, for example GpuDcgmConnectivityFailure, so they are marked AlreadyQuarantined.
  3. Emit healthy events for the tracked failing checks.
  4. Allow fault-quarantine to remove the quarantine annotation and record UnQuarantined.
  5. Allow queued or requeued AlreadyQuarantined events to be processed by node-drainer after the annotation is removed.
  6. Observe that node-drainer logs No quarantine annotation found for node, then proceeds to mark stale events as Succeeded and updates the node state label.

Environment

  • NVSentinel version: v1.9.0
  • Kubernetes version:
  • Deployment method:

Logs/Output

The key log sequence is:

No quarantine annotation found for node
All pods evicted successfully on node
Evaluated action for node action=UpdateStatus
Health event status has been updated evictionStatus=Succeeded

This happened for stale AlreadyQuarantined events after the node had already been unquarantined.

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions