feat: fallback to a reboot if a GPU reset fails#1240
feat: fallback to a reboot if a GPU reset fails#1240natherz97 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughGPU reset script now emits per-UUID syslog lines that include a success boolean and can be disabled via a spec flag. The syslog health-monitor parser and handler extract that success flag to emit conditional health events. The GPUReset CRD, controller, and tests were extended to propagate and validate the new behavior. Changes
Sequence DiagramsequenceDiagram
participant GPU as GPU Device
participant Script as GPU Reset Script
participant Syslog as Syslog
participant Monitor as Health Monitor
participant KubeAPI as Kubernetes API
rect rgba(76, 175, 80, 0.5)
Note over GPU,Script: Successful GPU reset path
GPU->>Script: perform reset
Script->>Script: set SYSLOG_SUCCESS = "true"
Script->>Syslog: logger "GPU reset executed: <UUID>, success: true"
Syslog->>Monitor: deliver log line
Monitor->>Monitor: parse UUID and success=true
Monitor->>KubeAPI: create healthy, non-fatal HealthEvent (RecommendedAction_NONE)
end
rect rgba(244, 67, 54, 0.5)
Note over GPU,Script: Failed GPU reset path
GPU->>Script: reset fails
Script->>Script: set SYSLOG_SUCCESS = "false"
Script->>Syslog: logger "GPU reset executed: <UUID>, success: false"
Syslog->>Monitor: deliver log line
Monitor->>Monitor: parse UUID and success=false
Monitor->>KubeAPI: create fatal/unhealthy HealthEvent (GPU_RESET_FAILURE, RecommendedAction_RESTART_VM)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🌿 Fern Docs Preview: https://nvidia-preview-pull-request-1240.docs.buildwithfern.com/nvsentinel |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/uat/tests.sh (1)
445-489:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCritical: Variable name mismatches and missing
$expansions will cause the test to fail or behave incorrectly.Several issues in the boot ID verification logic:
- Line 447: Missing
$- logs literal string "initial_boot_id" instead of the variable value- Line 486: Uses undefined
$original_boot_idinstead of the declared$initial_boot_id- Line 487: Missing
$in error message forinitial_boot_idProposed fix
local initial_boot_id initial_boot_id=$(get_boot_id "$gpu_node") - log "Original boot ID: initial_boot_id" + log "Original boot ID: $initial_boot_id" local dcgm_pod ... # If the GPU reset job fails, we will write a syslog event which results in a new unhealthy health event with a # RESTART_VM recommended action. We will confirm the node bootID does not change during the test execution to # ensure that a GPU reset and not a reboot recovered the node. local final_boot_id final_boot_id=$(get_boot_id "$gpu_node") - if [[ "$final_boot_id" != "$original_boot_id" ]]; then - error "Boot ID changed during GPU reset. Original: initial_boot_id, Final: $final_boot_id" + if [[ "$final_boot_id" != "$initial_boot_id" ]]; then + error "Boot ID changed during GPU reset. Original: $initial_boot_id, Final: $final_boot_id" fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/uat/tests.sh` around lines 445 - 489, The boot ID check has variable name and expansion bugs: replace the literal and wrong names so actual values are compared and logged—use initial_boot_id (set by get_boot_id "$gpu_node") consistently (not original_boot_id), expand variables with $ when logging and in the error message, and ensure final_boot_id is compared to $initial_boot_id in the if condition and the error/log calls (references: initial_boot_id, final_boot_id, get_boot_id, error, log).
🧹 Nitpick comments (3)
tests/syslog_health_monitor_test.go (1)
114-135: 💤 Low valueDuplicate test case: identical to "Inject XID error requiring GPU reset" on lines 67-88.
This test case appears to be an exact duplicate of the first assess block (lines 67-88). Both inject the same XID 119 message and verify the same expected sequence patterns. This may be intentional to set up state before the "failed GPU reset" test, but consider adding a comment explaining the purpose or consolidating if unintentional.
If intentional, add clarifying comment
+ // Re-inject XID error to set up state for the failed GPU reset test scenario. + // The previous successful GPU reset cleared the condition, so we need a new error. feature.Assess("Inject XID error requiring GPU reset", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/syslog_health_monitor_test.go` around lines 114 - 135, This Assess block duplicates the earlier "Inject XID error requiring GPU reset" test; either remove the duplicate or explicitly document why it’s repeated: if it’s intentional to prime state for the subsequent "failed GPU reset" test, add a single-line comment before the feature.Assess call clarifying that purpose and reference helpers.InjectSyslogMessages and helpers.VerifyNodeConditionMatchesSequence so reviewers know it is a deliberate state-priming injection; otherwise consolidate by reusing the original Assess or a shared helper function to avoid duplicated assertions.health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
264-268: 💤 Low valueComment contains typos and grammar issues.
The comment block has readability issues: "pikcup" should be "pick up", and the grammar could be improved.
Suggested fix
/* -Flows could be from DCGM + syslog for initial event -the healthy event for reset always from syslog OR unhealthy event needing reboot always from syslog as well -we require that either DCGM will pikcup the reboot +Flows could be from DCGM + syslog for the initial event. +The healthy event for reset always comes from syslog. An unhealthy event needing reboot also comes from syslog. +We require that DCGM will pick up the reboot. */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go` around lines 264 - 268, Update the block comment in xid_handler.go that currently reads "Flows could be from DCGM + syslog for initial event the healthy event for reset always from syslog OR unhealthy event needing reboot always from syslog as well we require that either DCGM will pikcup the reboot" by fixing typos and improving grammar (replace "pikcup" with "pick up" and rephrase sentences for clarity), e.g., explain flows: initial events may come from DCGM or syslog, healthy/reset and unhealthy/reboot events originate from syslog, and either DCGM or syslog must pick up the reboot; apply this corrected wording to the existing comment block in xid_handler.go.janitor/api/v1alpha1/gpureset_types.go (1)
117-122: 💤 Low valueMinor: Comment slightly misrepresents behavior.
The comment states the syslog entry is written "upon successful reset," but the implementation in
gpu_reset.shwrites to syslog for both successful and failed resets (withsuccess: true|false). Consider updating the comment for accuracy.Suggested documentation fix
- // WriteSyslogEvent controls whether the GPU reset job writes a syslog entry - // upon successful reset, which triggers the syslog-health-monitor. + // WriteSyslogEvent controls whether the GPU reset job writes a syslog entry + // after reset completion (success or failure), which triggers the syslog-health-monitor.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@janitor/api/v1alpha1/gpureset_types.go` around lines 117 - 122, Update the comment for the WriteSyslogEvent field to accurately reflect gpu_reset.sh behavior: state that when enabled the job writes a syslog entry for reset attempts regardless of outcome (including success: true or false), not only on successful resets. Reference the WriteSyslogEvent field and gpu_reset.sh in the comment so readers know the behavior is driven by that script.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@gpu-reset/gpu_reset.sh`:
- Line 172: The if-condition referencing WRITE_SYSLOG_EVENT will fail under set
-u if the variable is unset; change the check in the if that currently reads the
WRITE_SYSLOG_EVENT variable so it uses a safe default (e.g., parameter expansion
like ${WRITE_SYSLOG_EVENT:-false}) to avoid unbound variable errors when
evaluating the condition; update the if that guards syslog writing to use the
safe expansion so the script continues even if WRITE_SYSLOG_EVENT is not
exported.
In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 285-306: The GPU reset HealthEvent construction is missing the
ProcessingStrategy field; update the block that builds the event (the event :=
&pb.HealthEvent{...} and the success/else branches in xid_handler.go) to set
ProcessingStrategy: xidHandler.processingStrategy so both success and failure
GPU reset paths include ProcessingStrategy (consistent with
createHealthEventFromResponse).
---
Outside diff comments:
In `@tests/uat/tests.sh`:
- Around line 445-489: The boot ID check has variable name and expansion bugs:
replace the literal and wrong names so actual values are compared and logged—use
initial_boot_id (set by get_boot_id "$gpu_node") consistently (not
original_boot_id), expand variables with $ when logging and in the error
message, and ensure final_boot_id is compared to $initial_boot_id in the if
condition and the error/log calls (references: initial_boot_id, final_boot_id,
get_boot_id, error, log).
---
Nitpick comments:
In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 264-268: Update the block comment in xid_handler.go that currently
reads "Flows could be from DCGM + syslog for initial event the healthy event for
reset always from syslog OR unhealthy event needing reboot always from syslog as
well we require that either DCGM will pikcup the reboot" by fixing typos and
improving grammar (replace "pikcup" with "pick up" and rephrase sentences for
clarity), e.g., explain flows: initial events may come from DCGM or syslog,
healthy/reset and unhealthy/reboot events originate from syslog, and either DCGM
or syslog must pick up the reboot; apply this corrected wording to the existing
comment block in xid_handler.go.
In `@janitor/api/v1alpha1/gpureset_types.go`:
- Around line 117-122: Update the comment for the WriteSyslogEvent field to
accurately reflect gpu_reset.sh behavior: state that when enabled the job writes
a syslog entry for reset attempts regardless of outcome (including success: true
or false), not only on successful resets. Reference the WriteSyslogEvent field
and gpu_reset.sh in the comment so readers know the behavior is driven by that
script.
In `@tests/syslog_health_monitor_test.go`:
- Around line 114-135: This Assess block duplicates the earlier "Inject XID
error requiring GPU reset" test; either remove the duplicate or explicitly
document why it’s repeated: if it’s intentional to prime state for the
subsequent "failed GPU reset" test, add a single-line comment before the
feature.Assess call clarifying that purpose and reference
helpers.InjectSyslogMessages and helpers.VerifyNodeConditionMatchesSequence so
reviewers know it is a deliberate state-priming injection; otherwise consolidate
by reusing the original Assess or a shared helper function to avoid duplicated
assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fcf83a02-5280-4148-8ae1-107db633c9c3
📒 Files selected for processing (8)
gpu-reset/gpu_reset.shhealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gojanitor/api/v1alpha1/gpureset_types.gojanitor/pkg/controller/gpureset_controller.gotests/syslog_health_monitor_test.gotests/uat/tests.sh
9fd2612 to
9393a97
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@janitor/api/v1alpha1/gpureset_types.go`:
- Around line 118-122: The field comment for WriteSyslogEvent is out of date:
update the doc comment above the WriteSyslogEvent *bool
`json:"writeSyslogEvent,omitempty"` line to state that the GPU reset job emits
result-aware syslog entries (including success: true/false) rather than only
writing on successful resets; keep the kubebuilder tags and optional annotation
unchanged and ensure the new text clearly states it controls whether a
result-aware syslog entry (indicating success or failure) is written.
In `@tests/uat/tests.sh`:
- Around line 445-447: The test sets initial_boot_id via
initial_boot_id=$(get_boot_id "$gpu_node") but later uses the unset
original_boot_id and logs the literal token; change all references to
original_boot_id to initial_boot_id (including the comparison/assertion and the
log call), and make sure you expand the variable in log and comparisons as
"$initial_boot_id" so the script remains safe under set -u; apply the same fix
to the other occurrence mentioned (the second block around the later
comparison).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b9e5729a-9574-40d7-a5f2-56bb5f310e6c
📒 Files selected for processing (10)
distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yamlgpu-reset/gpu_reset.shhealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gojanitor/api/v1alpha1/gpureset_types.gojanitor/api/v1alpha1/zz_generated.deepcopy.gojanitor/pkg/controller/gpureset_controller.gotests/syslog_health_monitor_test.gotests/uat/tests.sh
✅ Files skipped from review due to trivial changes (2)
- health-monitors/syslog-health-monitor/pkg/xid/types.go
- janitor/api/v1alpha1/zz_generated.deepcopy.go
🚧 Files skipped from review as they are similar to previous changes (3)
- janitor/pkg/controller/gpureset_controller.go
- tests/syslog_health_monitor_test.go
- health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
9393a97 to
5b51a2e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/syslog_health_monitor_test.go`:
- Around line 169-176: The test constructs a contradictory healthy event by
calling WithHealthy(true) while also setting WithFatal(true); update the
simulated healthy SysLogsXIDError event to use a non-fatal flag (call
WithFatal(false)) so the event represents a true healthy/reset state. Locate the
creation chain starting with helpers.NewHealthEvent(...) and change the
WithFatal invocation on that Healthy event to false, leaving WithHealthy(true)
and the other fields unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ff416b69-879e-41a3-956c-6ce063912b3d
📒 Files selected for processing (13)
distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yamlgpu-reset/gpu_reset.shhealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gojanitor/api/v1alpha1/gpureset_types.gojanitor/api/v1alpha1/gpureset_types_test.gojanitor/api/v1alpha1/zz_generated.deepcopy.gojanitor/pkg/controller/gpureset_controller.gojanitor/pkg/controller/gpureset_controller_test.gotests/gpu_reset_test.gotests/syslog_health_monitor_test.gotests/uat/tests.sh
✅ Files skipped from review due to trivial changes (1)
- health-monitors/syslog-health-monitor/pkg/xid/types.go
🚧 Files skipped from review as they are similar to previous changes (4)
- janitor/api/v1alpha1/zz_generated.deepcopy.go
- distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
- janitor/pkg/controller/gpureset_controller.go
- gpu-reset/gpu_reset.sh
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
5b51a2e to
a866942
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
a866942 to
029cd74
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 286-310: Update the inline example payload in xid_handler.go so
the healthevent.errorcode value matches the current emitted failure token:
replace the numeric '95' with the string "GPU_RESET_FAILURE" in the example
object (look for the healthevent block associated with checkname
'SysLogsXIDError'); also normalize any non-ASCII quote characters in that
example (e.g., around message, GPU_UUID, nodename) to standard ASCII quotes so
the sample is valid and consistent with the code that emits errorcode.
In `@janitor/pkg/controller/gpureset_controller_test.go`:
- Around line 760-768: The DeferCleanup currently deletes entryNode and the
GPUReset (entryResetName) but doesn't remove the per-entry Job, leaving syslog-*
jobs behind; update the cleanup in the table test to also delete the Job created
per entry (use the same entryResetName or the actual Job name pattern used when
creating jobs) by calling k8sClient.Delete(ctx, jobObj) and ignore NotFound
errors similar to the existing deletes (refer to DeferCleanup, entryNode,
entryResetName and k8sClient.Delete to locate where to add the deletion).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e5f47b03-4b0e-4721-857d-84510c2db3fe
📒 Files selected for processing (13)
distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yamlgpu-reset/gpu_reset.shhealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gojanitor/api/v1alpha1/gpureset_types.gojanitor/api/v1alpha1/gpureset_types_test.gojanitor/api/v1alpha1/zz_generated.deepcopy.gojanitor/pkg/controller/gpureset_controller.gojanitor/pkg/controller/gpureset_controller_test.gotests/gpu_reset_test.gotests/syslog_health_monitor_test.gotests/uat/tests.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/gpu_reset_test.go
- tests/syslog_health_monitor_test.go
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
029cd74 to
981d63c
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Signed-off-by: Nathan Herz <nherz@nvidia.com>
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Summary
This PR adds logic to fallback to a reboot if a GPU reset fails.
1. Create a RESTART_VM event when GPU resets fail: currently, a healthy event is emitted by the syslog-health-monitor when it detects that a GPU was successfully reset by consuming a syslog event written by the GPU reset job. Now, we will also start emitting an unhealthy event with a RESTART_VM recommended action by the syslog-health-monitor when it detects that a GPU failed to be reset. The resulting reboot will clear both events for the initial XID needing a reset and subsequent failed reset event needing a reboot and the node will be uncordoned.
2. Add a new writeSyslogEvent option to the Janitor config: this gpuResetController config property defaults to true if not explicitly set. If true, the GPU reset job will write syslog events for successful and failed resets that will be consumed by the syslog-health-monitor. If false, the GPU reset will not write syslog events which will prevent automatic uncordoning for successful resets and reboot fallbacks for failed resets. Opting out of the syslog writing behavior might be desirable for GPUResets triggered outside of NVSentinel so that debugging failed resets can occur without reboots being triggered.
Overview for the reboot fallback
A COMPONENT_RESET remediation will result in the following log line being emitted to syslog and consumed by the syslog-health-monitor (regardless of which health-monitor created the original COMPONENT_RESET unhealthy event):
Example event:
Notes:
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Behavior Changes / Health
Tests