Skip to content

[Bug]: remediation-failed node label can remain after the unsupported failing check has recovered #1416

Description

@XRFXLP

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Code of Conduct

  • I agree to follow NVSentinel's Code of Conduct

Bug Description

dgxc.nvidia.com/nvsentinel-state can remain stuck at remediation-failed after the failure that caused that label has recovered.

Scenario:

  1. Event A is fatal and has a supported remediation action, so fault-remediation creates a remediation CR.
  2. Event B is fatal but has an unsupported action, for example CONTACT_SUPPORT, so fault-remediation sets the node label to remediation-failed.
  3. Event C is a healthy recovery event for B. fault-quarantine removes B from the active quarantine annotation.
  4. Event A is still unresolved, so the node remains quarantined and no UnQuarantined event is emitted.
  5. Because cleanup only happens on UnQuarantined / Cancelled, the node label stays remediation-failed even though B is no longer active.

Expected behavior:
The node state label should reflect the remaining active quarantine/remediation state. If the unsupported failure has recovered, remediation-failed should be cleared or recomputed from the still-active events.

Component

Fault Management

Steps to Reproduce

  1. Create a fatal health event for a node with a supported remediation action, for example COMPONENT_RESET, so the node is quarantined and fault-remediation creates a remediation CR.
  2. While the node is still quarantined, create another fatal health event for the same node with an unsupported remediation action, for example CONTACT_SUPPORT.
  3. Verify that fault-remediation sets:
dgxc.nvidia.com/nvsentinel-state=remediation-failed
  1. Create a healthy recovery event for the unsupported check from step 2, using the same check name/entities and RecommendedAction_NONE.
  2. Verify that fault-quarantine removes the recovered check from the quarantine annotation but keeps the node quarantined because the first event is still unresolved.
  3. Check the node label.
    Expected:
dgxc.nvidia.com/nvsentinel-state

is recomputed from the remaining active failure and no longer reports remediation-failed from the recovered unsupported check.
Actual:

dgxc.nvidia.com/nvsentinel-state=remediation-failed

remains on the node even though the unsupported failure is no longer active.

Environment

  • NVSentinel version: v1.9.0
  • Kubernetes version:
  • Deployment method:

Logs/Output

No response

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions