feat: fallback to a reboot if a GPU reset fails by natherz97 · Pull Request #1240 · NVIDIA/NVSentinel

natherz97 · 2026-04-30T18:28:33Z

Summary

This PR adds logic to fallback to a reboot if a GPU reset fails.

1. Create a RESTART_VM event when GPU resets fail: currently, a healthy event is emitted by the syslog-health-monitor when it detects that a GPU was successfully reset by consuming a syslog event written by the GPU reset job. Now, we will also start emitting an unhealthy event with a RESTART_VM recommended action by the syslog-health-monitor when it detects that a GPU failed to be reset. The resulting reboot will clear both events for the initial XID needing a reset and subsequent failed reset event needing a reboot and the node will be uncordoned.
2. Add a new writeSyslogEvent option to the Janitor config: this gpuResetController config property defaults to true if not explicitly set. If true, the GPU reset job will write syslog events for successful and failed resets that will be consumed by the syslog-health-monitor. If false, the GPU reset will not write syslog events which will prevent automatic uncordoning for successful resets and reboot fallbacks for failed resets. Opting out of the syslog writing behavior might be desirable for GPUResets triggered outside of NVSentinel so that debugging failed resets can occur without reboots being triggered.

  apiVersion: janitor.dgxc.nvidia.com/v1alpha1                                                                                                                                                                                                                                                                    
  kind: GPUReset                                                                                                                                                                                                                                                                                                  
  metadata:                                                                                                                                                                                                                                                                                                       
    name: maintenance-my-node-abc123                                                                                                                                                                                                                                                                              
    annotations:                    
      nvsentinel.nvidia.com/trace-id: "some-trace-id"                                                                                                                                                                                                                                                             
      nvsentinel.nvidia.com/span-id: "some-span-id"                                                                                                                                                                                                                                                               
  spec:                                                                                                                                                                                                                                                                                                           
    nodeName: my-node                                                                                                                                                                                                                                                                                             
    selector:                                                                                                                                                                                                                                                                                                     
      uuids:                                                                                                                                                                                                                                                                                                      
        - GPU-455d8f70-2051-db6c-0430-ffc457bff834
    writeSyslogEvent: false

Overview for the reboot fallback

A COMPONENT_RESET remediation will result in the following log line being emitted to syslog and consumed by the syslog-health-monitor (regardless of which health-monitor created the original COMPONENT_RESET unhealthy event):

GPU reset executed: GPU-455d8f70-2051-db6c-0430-ffc457bff834, success: <true/false>

Successful resets: will result in a healthy event that can clear XID errors from the SysLogsXIDError check for matching impacted entities (GPU UUID and PCI ID). Note that if a different health-monitor emitted the original event, it would need to also check syslog or implement a different detection mechanism for GPU resets (for example the gpu-health-monitor relies on resets fixing the underlying DCGM watch or by having the nvidia-dcgm pod restarted as part of the GPU reset workflow).
Failed resets: will result in a new unhealthy event from the SysLogsXIDError check with the same impacted entities (GPU UUID and PCI ID) as the original health event and a RESTART_VM recommended action. Note that this flow will be triggered regardless of which health-monitor emitted the original COMPONENT_RESET event. This serves as a fallback where we will reboot a node if a GPU reset fails. This will result in the node being cordoned due to 2 events which are the original XID event and this subsequent GPU reset failed event. The syslog-health-monitor has logic to clear all unhealthy events for each of its checks in response to a reboot by sending a healthy event with empty impacted entities. As a result, we should expect that the node will be uncordoned after the reboot completes.

Example event:

	{
	  createdAt: ISODate('2026-04-30T10:46:50.263Z'),
	  healthevent: {
	    agent: 'syslog-health-monitor',
	    componentclass: 'GPU',
	    checkname: 'SysLogsXIDError',
	    isfatal: true,
	    ishealthy: false,
	    message: ‘GPU reset failed, proceeding with a node reboot',
	    recommendedaction: RESTART_VM,
	    errorcode: [
	      'GPU_RESET_FAILURE'
	    ],
	    entitiesimpacted: [
	      {
	        entitytype: 'PCI',
	        entityvalue: '000b:00:00'
	      },
	      {
	        entitytype: 'GPU_UUID',
	        entityvalue: 'GPU-123’
	      }
	    ],
	    nodename: ‘node-123’,
	  }
	}

Notes:

A follow-up unhealthy event is required to trigger a full drain in node-drainer because the original COMPONENT_RESET event would've done a partial drain only against pods using the GPU needing reset.
If a burst of XIDs occur with both COMPONENT_RESET and RESTART_VM recommended actions and the GPU reset fails, this logic is not necessary because a reboot would already be triggered. We will rely on the existing fault-remediation de-duplication logic to only process one of the reboots.
This logic is meant as a fallback when there are one of more XIDs with COMPONENT_RESET recommended actions which all result in failed GPU resets. When this logic is triggered, the following steps should be followed:
- Fix the underlying cause for the GPU reset failure.
- If the reset failure is isolated to a given XID, override the recommended action from COMPONENT_RESET to RESTART_VM.
- If the GPU reset failure is not unique to a specific XID error, disable the GPU reset feature to always reboot.

Type of Change

Component(s) Affected

[] Core Services
Documentation/CI
Fault Management
[ X] Health Monitors
Janitor
Other: ____________

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

New Features
- GPU reset syslog entries now include a success/failure flag and can be disabled (default: enabled).
- GPU reset job exposes a configurable option to control syslog emission.
Behavior Changes / Health
- Health monitor distinguishes successful vs failed GPU resets and emits appropriate health events and recommended actions.
Tests
- Expanded unit, integration, and UAT coverage for success/failure flows and boot-ID verification.

coderabbitai · 2026-04-30T18:28:48Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

GPU reset script now emits per-UUID syslog lines that include a success boolean and can be disabled via a spec flag. The syslog health-monitor parser and handler extract that success flag to emit conditional health events. The GPUReset CRD, controller, and tests were extended to propagate and validate the new behavior.

Changes

Cohort / File(s)	Summary
GPU Reset Script `gpu-reset/gpu_reset.sh`	Compute `SYSLOG_SUCCESS` from exit status and conditionally write per-UUID syslog messages when `WRITE_SYSLOG_EVENT` is `"true"`; syslog text now includes `, success: true
Syslog Health Monitor Parser & Handler `health-monitors/syslog-health-monitor/pkg/xid/types.go`, `health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`	Regex extended to capture `success`; parsing returns `(uuid, success)` and health-event creation now emits healthy/non-fatal events on success and fatal/unhealthy events (ErrorCode `GPU_RESET_FAILURE`, `RESTART_VM`) on failure.
Syslog Health Monitor Tests `health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go`, `tests/syslog_health_monitor_test.go`	Tests updated to assert parsing of `success` (true/false) and validate resulting health event contents, error codes, messages, recommended actions, and node-condition sequences.
API, CRD & DeepCopy `janitor/api/v1alpha1/gpureset_types.go`, `janitor/api/v1alpha1/zz_generated.deepcopy.go`, `distros/.../crds/janitor.dgxc.nvidia.com_gpuresets.yaml`	Added `spec.writeSyslogEvent *bool` with kubebuilder default `true` to GPUReset CRD; deepcopy updated to deep-copy the pointer; CRD printer descriptions tweaked.
Controller & Controller Tests `janitor/pkg/controller/gpureset_controller.go`, `janitor/pkg/controller/gpureset_controller_test.go`	Controller injects `WRITE_SYSLOG_EVENT` env var into GPU reset Job based on `spec.writeSyslogEvent` (defaults to `"true"`); tests added to verify env var values for nil/true/false cases.
Integration / UAT Tests & Utility `tests/uat/tests.sh`, `tests/gpu_reset_test.go`	UAT test snapshots `bootID` and fails if node rebooted during reset; integration tests assert `spec.writeSyslogEvent` defaults to `true` and include additional assertions and scenario flows for GPU reset success/failure.

Sequence Diagram

sequenceDiagram
    participant GPU as GPU Device
    participant Script as GPU Reset Script
    participant Syslog as Syslog
    participant Monitor as Health Monitor
    participant KubeAPI as Kubernetes API

    rect rgba(76, 175, 80, 0.5)
    Note over GPU,Script: Successful GPU reset path
    GPU->>Script: perform reset
    Script->>Script: set SYSLOG_SUCCESS = "true"
    Script->>Syslog: logger "GPU reset executed: <UUID>, success: true"
    Syslog->>Monitor: deliver log line
    Monitor->>Monitor: parse UUID and success=true
    Monitor->>KubeAPI: create healthy, non-fatal HealthEvent (RecommendedAction_NONE)
    end

    rect rgba(244, 67, 54, 0.5)
    Note over GPU,Script: Failed GPU reset path
    GPU->>Script: reset fails
    Script->>Script: set SYSLOG_SUCCESS = "false"
    Script->>Syslog: logger "GPU reset executed: <UUID>, success: false"
    Syslog->>Monitor: deliver log line
    Monitor->>Monitor: parse UUID and success=false
    Monitor->>KubeAPI: create fatal/unhealthy HealthEvent (GPU_RESET_FAILURE, RecommendedAction_RESTART_VM)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped to check the GPU reset trace,
Logged each UUID and its true/false face.
From script to syslog, parsed with care,
Health events sing what they discover there.
A rabbit nods—decide to reboot or spare.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title 'feat: fallback to a reboot if a GPU reset fails' accurately summarizes the main feature change across all modified files, which implement fallback reboot logic triggered by GPU reset failures.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-30T18:33:03Z

🌿 Fern Docs Preview: https://nvidia-preview-pull-request-1240.docs.buildwithfern.com/nvsentinel

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/uat/tests.sh (1)

445-489: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Variable name mismatches and missing $ expansions will cause the test to fail or behave incorrectly.

Several issues in the boot ID verification logic:

Line 447: Missing $ - logs literal string "initial_boot_id" instead of the variable value
Line 486: Uses undefined $original_boot_id instead of the declared $initial_boot_id
Line 487: Missing $ in error message for initial_boot_id

Proposed fix

     local initial_boot_id
     initial_boot_id=$(get_boot_id "$gpu_node")
-    log "Original boot ID: initial_boot_id"
+    log "Original boot ID: $initial_boot_id"

     local dcgm_pod
     ...
     # If the GPU reset job fails, we will write a syslog event which results in a new unhealthy health event with a
     # RESTART_VM recommended action. We will confirm the node bootID does not change during the test execution to
     # ensure that a GPU reset and not a reboot recovered the node.
     local final_boot_id
     final_boot_id=$(get_boot_id "$gpu_node")
-    if [[ "$final_boot_id" != "$original_boot_id" ]]; then
-        error "Boot ID changed during GPU reset. Original: initial_boot_id, Final: $final_boot_id"
+    if [[ "$final_boot_id" != "$initial_boot_id" ]]; then
+        error "Boot ID changed during GPU reset. Original: $initial_boot_id, Final: $final_boot_id"
     fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 445 - 489, The boot ID check has variable
name and expansion bugs: replace the literal and wrong names so actual values
are compared and logged—use initial_boot_id (set by get_boot_id "$gpu_node")
consistently (not original_boot_id), expand variables with $ when logging and in
the error message, and ensure final_boot_id is compared to $initial_boot_id in
the if condition and the error/log calls (references: initial_boot_id,
final_boot_id, get_boot_id, error, log).

🧹 Nitpick comments (3)

tests/syslog_health_monitor_test.go (1)

114-135: 💤 Low value

Duplicate test case: identical to "Inject XID error requiring GPU reset" on lines 67-88.

This test case appears to be an exact duplicate of the first assess block (lines 67-88). Both inject the same XID 119 message and verify the same expected sequence patterns. This may be intentional to set up state before the "failed GPU reset" test, but consider adding a comment explaining the purpose or consolidating if unintentional.
If intentional, add clarifying comment
+	// Re-inject XID error to set up state for the failed GPU reset test scenario.
+	// The previous successful GPU reset cleared the condition, so we need a new error.
 	feature.Assess("Inject XID error requiring GPU reset", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/syslog_health_monitor_test.go` around lines 114 - 135, This Assess
block duplicates the earlier "Inject XID error requiring GPU reset" test; either
remove the duplicate or explicitly document why it’s repeated: if it’s
intentional to prime state for the subsequent "failed GPU reset" test, add a
single-line comment before the feature.Assess call clarifying that purpose and
reference helpers.InjectSyslogMessages and
helpers.VerifyNodeConditionMatchesSequence so reviewers know it is a deliberate
state-priming injection; otherwise consolidate by reusing the original Assess or
a shared helper function to avoid duplicated assertions.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

264-268: 💤 Low value

Comment contains typos and grammar issues.

The comment block has readability issues: "pikcup" should be "pick up", and the grammar could be improved.

Suggested fix

 /*
-Flows could be from DCGM + syslog for initial event
-the healthy event for reset always from syslog OR unhealthy event needing reboot always from syslog as well
-we require that either DCGM will pikcup the reboot
+Flows could be from DCGM + syslog for the initial event.
+The healthy event for reset always comes from syslog. An unhealthy event needing reboot also comes from syslog.
+We require that DCGM will pick up the reboot.
 */

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go` around lines
264 - 268, Update the block comment in xid_handler.go that currently reads
"Flows could be from DCGM + syslog for initial event the healthy event for reset
always from syslog OR unhealthy event needing reboot always from syslog as well
we require that either DCGM will pikcup the reboot" by fixing typos and
improving grammar (replace "pikcup" with "pick up" and rephrase sentences for
clarity), e.g., explain flows: initial events may come from DCGM or syslog,
healthy/reset and unhealthy/reboot events originate from syslog, and either DCGM
or syslog must pick up the reboot; apply this corrected wording to the existing
comment block in xid_handler.go.

janitor/api/v1alpha1/gpureset_types.go (1)

117-122: 💤 Low value

Minor: Comment slightly misrepresents behavior.

The comment states the syslog entry is written "upon successful reset," but the implementation in gpu_reset.sh writes to syslog for both successful and failed resets (with success: true|false). Consider updating the comment for accuracy.

Suggested documentation fix

-	// WriteSyslogEvent controls whether the GPU reset job writes a syslog entry
-	// upon successful reset, which triggers the syslog-health-monitor.
+	// WriteSyslogEvent controls whether the GPU reset job writes a syslog entry
+	// after reset completion (success or failure), which triggers the syslog-health-monitor.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@janitor/api/v1alpha1/gpureset_types.go` around lines 117 - 122, Update the
comment for the WriteSyslogEvent field to accurately reflect gpu_reset.sh
behavior: state that when enabled the job writes a syslog entry for reset
attempts regardless of outcome (including success: true or false), not only on
successful resets. Reference the WriteSyslogEvent field and gpu_reset.sh in the
comment so readers know the behavior is driven by that script.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gpu-reset/gpu_reset.sh`:
- Line 172: The if-condition referencing WRITE_SYSLOG_EVENT will fail under set
-u if the variable is unset; change the check in the if that currently reads the
WRITE_SYSLOG_EVENT variable so it uses a safe default (e.g., parameter expansion
like ${WRITE_SYSLOG_EVENT:-false}) to avoid unbound variable errors when
evaluating the condition; update the if that guards syslog writing to use the
safe expansion so the script continues even if WRITE_SYSLOG_EVENT is not
exported.

In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 285-306: The GPU reset HealthEvent construction is missing the
ProcessingStrategy field; update the block that builds the event (the event :=
&pb.HealthEvent{...} and the success/else branches in xid_handler.go) to set
ProcessingStrategy: xidHandler.processingStrategy so both success and failure
GPU reset paths include ProcessingStrategy (consistent with
createHealthEventFromResponse).

---

Outside diff comments:
In `@tests/uat/tests.sh`:
- Around line 445-489: The boot ID check has variable name and expansion bugs:
replace the literal and wrong names so actual values are compared and logged—use
initial_boot_id (set by get_boot_id "$gpu_node") consistently (not
original_boot_id), expand variables with $ when logging and in the error
message, and ensure final_boot_id is compared to $initial_boot_id in the if
condition and the error/log calls (references: initial_boot_id, final_boot_id,
get_boot_id, error, log).

---

Nitpick comments:
In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 264-268: Update the block comment in xid_handler.go that currently
reads "Flows could be from DCGM + syslog for initial event the healthy event for
reset always from syslog OR unhealthy event needing reboot always from syslog as
well we require that either DCGM will pikcup the reboot" by fixing typos and
improving grammar (replace "pikcup" with "pick up" and rephrase sentences for
clarity), e.g., explain flows: initial events may come from DCGM or syslog,
healthy/reset and unhealthy/reboot events originate from syslog, and either DCGM
or syslog must pick up the reboot; apply this corrected wording to the existing
comment block in xid_handler.go.

In `@janitor/api/v1alpha1/gpureset_types.go`:
- Around line 117-122: Update the comment for the WriteSyslogEvent field to
accurately reflect gpu_reset.sh behavior: state that when enabled the job writes
a syslog entry for reset attempts regardless of outcome (including success: true
or false), not only on successful resets. Reference the WriteSyslogEvent field
and gpu_reset.sh in the comment so readers know the behavior is driven by that
script.

In `@tests/syslog_health_monitor_test.go`:
- Around line 114-135: This Assess block duplicates the earlier "Inject XID
error requiring GPU reset" test; either remove the duplicate or explicitly
document why it’s repeated: if it’s intentional to prime state for the
subsequent "failed GPU reset" test, add a single-line comment before the
feature.Assess call clarifying that purpose and reference
helpers.InjectSyslogMessages and helpers.VerifyNodeConditionMatchesSequence so
reviewers know it is a deliberate state-priming injection; otherwise consolidate
by reusing the original Assess or a shared helper function to avoid duplicated
assertions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fcf83a02-5280-4148-8ae1-107db633c9c3

📥 Commits

Reviewing files that changed from the base of the PR and between 7209217 and 9fd2612.

📒 Files selected for processing (8)

gpu-reset/gpu_reset.sh
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
janitor/api/v1alpha1/gpureset_types.go
janitor/pkg/controller/gpureset_controller.go
tests/syslog_health_monitor_test.go
tests/uat/tests.sh

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@janitor/api/v1alpha1/gpureset_types.go`:
- Around line 118-122: The field comment for WriteSyslogEvent is out of date:
update the doc comment above the WriteSyslogEvent *bool
`json:"writeSyslogEvent,omitempty"` line to state that the GPU reset job emits
result-aware syslog entries (including success: true/false) rather than only
writing on successful resets; keep the kubebuilder tags and optional annotation
unchanged and ensure the new text clearly states it controls whether a
result-aware syslog entry (indicating success or failure) is written.

In `@tests/uat/tests.sh`:
- Around line 445-447: The test sets initial_boot_id via
initial_boot_id=$(get_boot_id "$gpu_node") but later uses the unset
original_boot_id and logs the literal token; change all references to
original_boot_id to initial_boot_id (including the comparison/assertion and the
log call), and make sure you expand the variable in log and comparisons as
"$initial_boot_id" so the script remains safe under set -u; apply the same fix
to the other occurrence mentioned (the second block around the later
comparison).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9e5729a-9574-40d7-a5f2-56bb5f310e6c

📥 Commits

Reviewing files that changed from the base of the PR and between 9fd2612 and 9393a97.

📒 Files selected for processing (10)

distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
gpu-reset/gpu_reset.sh
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
janitor/api/v1alpha1/gpureset_types.go
janitor/api/v1alpha1/zz_generated.deepcopy.go
janitor/pkg/controller/gpureset_controller.go
tests/syslog_health_monitor_test.go
tests/uat/tests.sh

✅ Files skipped from review due to trivial changes (2)

health-monitors/syslog-health-monitor/pkg/xid/types.go
janitor/api/v1alpha1/zz_generated.deepcopy.go

🚧 Files skipped from review as they are similar to previous changes (3)

janitor/pkg/controller/gpureset_controller.go
tests/syslog_health_monitor_test.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/syslog_health_monitor_test.go`:
- Around line 169-176: The test constructs a contradictory healthy event by
calling WithHealthy(true) while also setting WithFatal(true); update the
simulated healthy SysLogsXIDError event to use a non-fatal flag (call
WithFatal(false)) so the event represents a true healthy/reset state. Locate the
creation chain starting with helpers.NewHealthEvent(...) and change the
WithFatal invocation on that Healthy event to false, leaving WithHealthy(true)
and the other fields unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ff416b69-879e-41a3-956c-6ce063912b3d

📥 Commits

Reviewing files that changed from the base of the PR and between 9393a97 and 5b51a2e.

📒 Files selected for processing (13)

distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
gpu-reset/gpu_reset.sh
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
janitor/api/v1alpha1/gpureset_types.go
janitor/api/v1alpha1/gpureset_types_test.go
janitor/api/v1alpha1/zz_generated.deepcopy.go
janitor/pkg/controller/gpureset_controller.go
janitor/pkg/controller/gpureset_controller_test.go
tests/gpu_reset_test.go
tests/syslog_health_monitor_test.go
tests/uat/tests.sh

✅ Files skipped from review due to trivial changes (1)

health-monitors/syslog-health-monitor/pkg/xid/types.go

🚧 Files skipped from review as they are similar to previous changes (4)

janitor/api/v1alpha1/zz_generated.deepcopy.go
distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
janitor/pkg/controller/gpureset_controller.go
gpu-reset/gpu_reset.sh

github-actions · 2026-04-30T20:47:31Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	19.82% (+0.06%)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller	13.61% (-0.00%)	👎
github.com/nvidia/nvsentinel/tests	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	66.67% (ø)	9	6	3
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	16.86% (+0.16%)	1619 (+38)	273 (+9)	1346 (+29)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	14.17% (-0.00%)	5259 (+37)	745 (+5)	4514 (+32)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

github-actions · 2026-04-30T21:06:37Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	19.82% (+0.06%)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller	13.63% (+0.02%)	👍
github.com/nvidia/nvsentinel/tests	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	66.67% (ø)	9	6	3
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	16.86% (+0.16%)	1619 (+38)	273 (+9)	1346 (+29)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	14.20% (+0.03%)	5259 (+37)	747 (+7)	4512 (+30)	👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 286-310: Update the inline example payload in xid_handler.go so
the healthevent.errorcode value matches the current emitted failure token:
replace the numeric '95' with the string "GPU_RESET_FAILURE" in the example
object (look for the healthevent block associated with checkname
'SysLogsXIDError'); also normalize any non-ASCII quote characters in that
example (e.g., around message, GPU_UUID, nodename) to standard ASCII quotes so
the sample is valid and consistent with the code that emits errorcode.

In `@janitor/pkg/controller/gpureset_controller_test.go`:
- Around line 760-768: The DeferCleanup currently deletes entryNode and the
GPUReset (entryResetName) but doesn't remove the per-entry Job, leaving syslog-*
jobs behind; update the cleanup in the table test to also delete the Job created
per entry (use the same entryResetName or the actual Job name pattern used when
creating jobs) by calling k8sClient.Delete(ctx, jobObj) and ignore NotFound
errors similar to the existing deletes (refer to DeferCleanup, entryNode,
entryResetName and k8sClient.Delete to locate where to add the deletion).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e5f47b03-4b0e-4721-857d-84510c2db3fe

📥 Commits

Reviewing files that changed from the base of the PR and between a866942 and 029cd74.

📒 Files selected for processing (13)

distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
gpu-reset/gpu_reset.sh
health-monitors/syslog-health-monitor/pkg/xid/types.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
janitor/api/v1alpha1/gpureset_types.go
janitor/api/v1alpha1/gpureset_types_test.go
janitor/api/v1alpha1/zz_generated.deepcopy.go
janitor/pkg/controller/gpureset_controller.go
janitor/pkg/controller/gpureset_controller_test.go
tests/gpu_reset_test.go
tests/syslog_health_monitor_test.go
tests/uat/tests.sh

🚧 Files skipped from review as they are similar to previous changes (2)

tests/gpu_reset_test.go
tests/syslog_health_monitor_test.go

github-actions · 2026-04-30T21:26:27Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	19.82% (+0.06%)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller	13.61% (-0.00%)	👎
github.com/nvidia/nvsentinel/tests	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	66.67% (ø)	9	6	3
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	16.86% (+0.16%)	1619 (+38)	273 (+9)	1346 (+29)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	14.17% (-0.00%)	5259 (+37)	745 (+5)	4514 (+32)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

github-actions · 2026-04-30T22:16:24Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1	19.82% (+0.06%)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller	13.61% (-0.00%)	👎
github.com/nvidia/nvsentinel/tests	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go	66.67% (ø)	9	6	3
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go	16.86% (+0.16%)	1619 (+38)	273 (+9)	1346 (+29)	👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go	14.17% (-0.00%)	5259 (+37)	745 (+5)	4514 (+32)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

github-actions · 2026-05-01T17:59:14Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config	19.19% (+0.24%)	👍
github.com/nvidia/nvsentinel/tests	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go	17.30% (+0.85%)	318 (+20)	55 (+6)	263 (+14)	👍
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go	20.16% (ø)	615	124	491

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

Signed-off-by: Nathan Herz <nherz@nvidia.com>

github-actions · 2026-05-01T18:22:41Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid	0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config	19.19% (+0.24%)	👍
github.com/nvidia/nvsentinel/tests	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go	17.30% (+0.85%)	318 (+20)	55 (+6)	263 (+14)	👍
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go	20.16% (ø)	615	124	491

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread gpu-reset/gpu_reset.sh Outdated

Comment thread health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go

natherz97 force-pushed the reboot-fallback branch from 9fd2612 to 9393a97 Compare April 30, 2026 20:07

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread janitor/api/v1alpha1/gpureset_types.go Outdated

Comment thread tests/uat/tests.sh Outdated

natherz97 force-pushed the reboot-fallback branch from 9393a97 to 5b51a2e Compare April 30, 2026 20:33

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread tests/syslog_health_monitor_test.go

natherz97 force-pushed the reboot-fallback branch from 5b51a2e to a866942 Compare April 30, 2026 20:53

natherz97 force-pushed the reboot-fallback branch from a866942 to 029cd74 Compare April 30, 2026 21:11

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go

Comment thread janitor/pkg/controller/gpureset_controller_test.go Outdated

natherz97 changed the title ~~[WIP] feat: fallback to a reboot if a GPU reset fails~~ feat: fallback to a reboot if a GPU reset fails Apr 30, 2026

natherz97 force-pushed the reboot-fallback branch from 029cd74 to 981d63c Compare April 30, 2026 22:02

pteranodan self-requested a review May 1, 2026 14:58

pteranodan reviewed May 1, 2026

View reviewed changes

Comment thread janitor/api/v1alpha1/gpureset_types.go Outdated

natherz97 force-pushed the reboot-fallback branch from 981d63c to d0bf5dd Compare May 1, 2026 17:42

natherz97 force-pushed the reboot-fallback branch from d0bf5dd to 2536b73 Compare May 1, 2026 18:05

feat: fallback to a reboot if a GPU reset fails

49d30aa

Signed-off-by: Nathan Herz <nherz@nvidia.com>

natherz97 force-pushed the reboot-fallback branch from 2536b73 to 49d30aa Compare May 1, 2026 18:08

Conversation

natherz97 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Overview for the reboot fallback

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 30, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions Bot commented Apr 30, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 30, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions Bot commented Apr 30, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions Bot commented May 1, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

natherz97 commented Apr 30, 2026 •

edited

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading