Prerequisites
Code of Conduct
Bug Description
dgxc.nvidia.com/nvsentinel-state can remain stuck at remediation-failed after the failure that caused that label has recovered.
Scenario:
- Event A is fatal and has a supported remediation action, so
fault-remediation creates a remediation CR.
- Event B is fatal but has an unsupported action, for example
CONTACT_SUPPORT, so fault-remediation sets the node label to remediation-failed.
- Event C is a healthy recovery event for B.
fault-quarantine removes B from the active quarantine annotation.
- Event A is still unresolved, so the node remains quarantined and no
UnQuarantined event is emitted.
- Because cleanup only happens on
UnQuarantined / Cancelled, the node label stays remediation-failed even though B is no longer active.
Expected behavior:
The node state label should reflect the remaining active quarantine/remediation state. If the unsupported failure has recovered, remediation-failed should be cleared or recomputed from the still-active events.
Component
Fault Management
Steps to Reproduce
- Create a fatal health event for a node with a supported remediation action, for example
COMPONENT_RESET, so the node is quarantined and fault-remediation creates a remediation CR.
- While the node is still quarantined, create another fatal health event for the same node with an unsupported remediation action, for example
CONTACT_SUPPORT.
- Verify that
fault-remediation sets:
dgxc.nvidia.com/nvsentinel-state=remediation-failed
- Create a healthy recovery event for the unsupported check from step 2, using the same check name/entities and
RecommendedAction_NONE.
- Verify that
fault-quarantine removes the recovered check from the quarantine annotation but keeps the node quarantined because the first event is still unresolved.
- Check the node label.
Expected:
dgxc.nvidia.com/nvsentinel-state
is recomputed from the remaining active failure and no longer reports remediation-failed from the recovered unsupported check.
Actual:
dgxc.nvidia.com/nvsentinel-state=remediation-failed
remains on the node even though the unsupported failure is no longer active.
Environment
- NVSentinel version: v1.9.0
- Kubernetes version:
- Deployment method:
Logs/Output
No response
Prerequisites
Code of Conduct
Bug Description
dgxc.nvidia.com/nvsentinel-statecan remain stuck atremediation-failedafter the failure that caused that label has recovered.Scenario:
fault-remediationcreates a remediation CR.CONTACT_SUPPORT, sofault-remediationsets the node label toremediation-failed.fault-quarantineremoves B from the active quarantine annotation.UnQuarantinedevent is emitted.UnQuarantined/Cancelled, the node label staysremediation-failedeven though B is no longer active.Expected behavior:
The node state label should reflect the remaining active quarantine/remediation state. If the unsupported failure has recovered,
remediation-failedshould be cleared or recomputed from the still-active events.Component
Fault Management
Steps to Reproduce
COMPONENT_RESET, so the node is quarantined andfault-remediationcreates a remediation CR.CONTACT_SUPPORT.fault-remediationsets:RecommendedAction_NONE.fault-quarantineremoves the recovered check from the quarantine annotation but keeps the node quarantined because the first event is still unresolved.Expected:
is recomputed from the remaining active failure and no longer reports
remediation-failedfrom the recovered unsupported check.Actual:
remains on the node even though the unsupported failure is no longer active.
Environment
Logs/Output
No response