Skip to content

feat: fallback to a reboot if a GPU reset fails#1240

Open
natherz97 wants to merge 1 commit intoNVIDIA:mainfrom
natherz97:reboot-fallback
Open

feat: fallback to a reboot if a GPU reset fails#1240
natherz97 wants to merge 1 commit intoNVIDIA:mainfrom
natherz97:reboot-fallback

Conversation

@natherz97
Copy link
Copy Markdown
Contributor

@natherz97 natherz97 commented Apr 30, 2026

Summary

This PR adds logic to fallback to a reboot if a GPU reset fails.

1. Create a RESTART_VM event when GPU resets fail: currently, a healthy event is emitted by the syslog-health-monitor when it detects that a GPU was successfully reset by consuming a syslog event written by the GPU reset job. Now, we will also start emitting an unhealthy event with a RESTART_VM recommended action by the syslog-health-monitor when it detects that a GPU failed to be reset. The resulting reboot will clear both events for the initial XID needing a reset and subsequent failed reset event needing a reboot and the node will be uncordoned.
2. Add a new writeSyslogEvent option to the Janitor config: this gpuResetController config property defaults to true if not explicitly set. If true, the GPU reset job will write syslog events for successful and failed resets that will be consumed by the syslog-health-monitor. If false, the GPU reset will not write syslog events which will prevent automatic uncordoning for successful resets and reboot fallbacks for failed resets. Opting out of the syslog writing behavior might be desirable for GPUResets triggered outside of NVSentinel so that debugging failed resets can occur without reboots being triggered.

  apiVersion: janitor.dgxc.nvidia.com/v1alpha1                                                                                                                                                                                                                                                                    
  kind: GPUReset                                                                                                                                                                                                                                                                                                  
  metadata:                                                                                                                                                                                                                                                                                                       
    name: maintenance-my-node-abc123                                                                                                                                                                                                                                                                              
    annotations:                    
      nvsentinel.nvidia.com/trace-id: "some-trace-id"                                                                                                                                                                                                                                                             
      nvsentinel.nvidia.com/span-id: "some-span-id"                                                                                                                                                                                                                                                               
  spec:                                                                                                                                                                                                                                                                                                           
    nodeName: my-node                                                                                                                                                                                                                                                                                             
    selector:                                                                                                                                                                                                                                                                                                     
      uuids:                                                                                                                                                                                                                                                                                                      
        - GPU-455d8f70-2051-db6c-0430-ffc457bff834
    writeSyslogEvent: false 

Overview for the reboot fallback

A COMPONENT_RESET remediation will result in the following log line being emitted to syslog and consumed by the syslog-health-monitor (regardless of which health-monitor created the original COMPONENT_RESET unhealthy event):

GPU reset executed: GPU-455d8f70-2051-db6c-0430-ffc457bff834, success: <true/false>
  1. Successful resets: will result in a healthy event that can clear XID errors from the SysLogsXIDError check for matching impacted entities (GPU UUID and PCI ID). Note that if a different health-monitor emitted the original event, it would need to also check syslog or implement a different detection mechanism for GPU resets (for example the gpu-health-monitor relies on resets fixing the underlying DCGM watch or by having the nvidia-dcgm pod restarted as part of the GPU reset workflow).
  2. Failed resets: will result in a new unhealthy event from the SysLogsXIDError check with the same impacted entities (GPU UUID and PCI ID) as the original health event and a RESTART_VM recommended action. Note that this flow will be triggered regardless of which health-monitor emitted the original COMPONENT_RESET event. This serves as a fallback where we will reboot a node if a GPU reset fails. This will result in the node being cordoned due to 2 events which are the original XID event and this subsequent GPU reset failed event. The syslog-health-monitor has logic to clear all unhealthy events for each of its checks in response to a reboot by sending a healthy event with empty impacted entities. As a result, we should expect that the node will be uncordoned after the reboot completes.

Example event:

	{
	  createdAt: ISODate('2026-04-30T10:46:50.263Z'),
	  healthevent: {
	    agent: 'syslog-health-monitor',
	    componentclass: 'GPU',
	    checkname: 'SysLogsXIDError',
	    isfatal: true,
	    ishealthy: false,
	    message: ‘GPU reset failed, proceeding with a node reboot',
	    recommendedaction: RESTART_VM,
	    errorcode: [
	      'GPU_RESET_FAILURE'
	    ],
	    entitiesimpacted: [
	      {
	        entitytype: 'PCI',
	        entityvalue: '000b:00:00'
	      },
	      {
	        entitytype: 'GPU_UUID',
	        entityvalue: 'GPU-123’
	      }
	    ],
	    nodename: ‘node-123’,
	  }
	}

Notes:

  • A follow-up unhealthy event is required to trigger a full drain in node-drainer because the original COMPONENT_RESET event would've done a partial drain only against pods using the GPU needing reset.
  • If a burst of XIDs occur with both COMPONENT_RESET and RESTART_VM recommended actions and the GPU reset fails, this logic is not necessary because a reboot would already be triggered. We will rely on the existing fault-remediation de-duplication logic to only process one of the reboots.
  • This logic is meant as a fallback when there are one of more XIDs with COMPONENT_RESET recommended actions which all result in failed GPU resets. When this logic is triggered, the following steps should be followed:
    • Fix the underlying cause for the GPU reset failure.
    • If the reset failure is isolated to a given XID, override the recommended action from COMPONENT_RESET to RESTART_VM.
    • If the GPU reset failure is not unique to a specific XID error, disable the GPU reset feature to always reboot.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • [] Core Services
  • Documentation/CI
  • Fault Management
  • [ X] Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • GPU reset syslog entries now include a success/failure flag and can be disabled (default: enabled).
    • GPU reset job exposes a configurable option to control syslog emission.
  • Behavior Changes / Health

    • Health monitor distinguishes successful vs failed GPU resets and emits appropriate health events and recommended actions.
  • Tests

    • Expanded unit, integration, and UAT coverage for success/failure flows and boot-ID verification.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

GPU reset script now emits per-UUID syslog lines that include a success boolean and can be disabled via a spec flag. The syslog health-monitor parser and handler extract that success flag to emit conditional health events. The GPUReset CRD, controller, and tests were extended to propagate and validate the new behavior.

Changes

Cohort / File(s) Summary
GPU Reset Script
gpu-reset/gpu_reset.sh
Compute SYSLOG_SUCCESS from exit status and conditionally write per-UUID syslog messages when WRITE_SYSLOG_EVENT is "true"; syslog text now includes `, success: true
Syslog Health Monitor Parser & Handler
health-monitors/syslog-health-monitor/pkg/xid/types.go, health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
Regex extended to capture success; parsing returns (uuid, success) and health-event creation now emits healthy/non-fatal events on success and fatal/unhealthy events (ErrorCode GPU_RESET_FAILURE, RESTART_VM) on failure.
Syslog Health Monitor Tests
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go, tests/syslog_health_monitor_test.go
Tests updated to assert parsing of success (true/false) and validate resulting health event contents, error codes, messages, recommended actions, and node-condition sequences.
API, CRD & DeepCopy
janitor/api/v1alpha1/gpureset_types.go, janitor/api/v1alpha1/zz_generated.deepcopy.go, distros/.../crds/janitor.dgxc.nvidia.com_gpuresets.yaml
Added spec.writeSyslogEvent *bool with kubebuilder default true to GPUReset CRD; deepcopy updated to deep-copy the pointer; CRD printer descriptions tweaked.
Controller & Controller Tests
janitor/pkg/controller/gpureset_controller.go, janitor/pkg/controller/gpureset_controller_test.go
Controller injects WRITE_SYSLOG_EVENT env var into GPU reset Job based on spec.writeSyslogEvent (defaults to "true"); tests added to verify env var values for nil/true/false cases.
Integration / UAT Tests & Utility
tests/uat/tests.sh, tests/gpu_reset_test.go
UAT test snapshots bootID and fails if node rebooted during reset; integration tests assert spec.writeSyslogEvent defaults to true and include additional assertions and scenario flows for GPU reset success/failure.

Sequence Diagram

sequenceDiagram
    participant GPU as GPU Device
    participant Script as GPU Reset Script
    participant Syslog as Syslog
    participant Monitor as Health Monitor
    participant KubeAPI as Kubernetes API

    rect rgba(76, 175, 80, 0.5)
    Note over GPU,Script: Successful GPU reset path
    GPU->>Script: perform reset
    Script->>Script: set SYSLOG_SUCCESS = "true"
    Script->>Syslog: logger "GPU reset executed: <UUID>, success: true"
    Syslog->>Monitor: deliver log line
    Monitor->>Monitor: parse UUID and success=true
    Monitor->>KubeAPI: create healthy, non-fatal HealthEvent (RecommendedAction_NONE)
    end

    rect rgba(244, 67, 54, 0.5)
    Note over GPU,Script: Failed GPU reset path
    GPU->>Script: reset fails
    Script->>Script: set SYSLOG_SUCCESS = "false"
    Script->>Syslog: logger "GPU reset executed: <UUID>, success: false"
    Syslog->>Monitor: deliver log line
    Monitor->>Monitor: parse UUID and success=false
    Monitor->>KubeAPI: create fatal/unhealthy HealthEvent (GPU_RESET_FAILURE, RecommendedAction_RESTART_VM)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped to check the GPU reset trace,
Logged each UUID and its true/false face.
From script to syslog, parsed with care,
Health events sing what they discover there.
A rabbit nods—decide to reboot or spare.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title 'feat: fallback to a reboot if a GPU reset fails' accurately summarizes the main feature change across all modified files, which implement fallback reboot logic triggered by GPU reset failures.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/uat/tests.sh (1)

445-489: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: Variable name mismatches and missing $ expansions will cause the test to fail or behave incorrectly.

Several issues in the boot ID verification logic:

  1. Line 447: Missing $ - logs literal string "initial_boot_id" instead of the variable value
  2. Line 486: Uses undefined $original_boot_id instead of the declared $initial_boot_id
  3. Line 487: Missing $ in error message for initial_boot_id
Proposed fix
     local initial_boot_id
     initial_boot_id=$(get_boot_id "$gpu_node")
-    log "Original boot ID: initial_boot_id"
+    log "Original boot ID: $initial_boot_id"

     local dcgm_pod
     ...
     # If the GPU reset job fails, we will write a syslog event which results in a new unhealthy health event with a
     # RESTART_VM recommended action. We will confirm the node bootID does not change during the test execution to
     # ensure that a GPU reset and not a reboot recovered the node.
     local final_boot_id
     final_boot_id=$(get_boot_id "$gpu_node")
-    if [[ "$final_boot_id" != "$original_boot_id" ]]; then
-        error "Boot ID changed during GPU reset. Original: initial_boot_id, Final: $final_boot_id"
+    if [[ "$final_boot_id" != "$initial_boot_id" ]]; then
+        error "Boot ID changed during GPU reset. Original: $initial_boot_id, Final: $final_boot_id"
     fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/uat/tests.sh` around lines 445 - 489, The boot ID check has variable
name and expansion bugs: replace the literal and wrong names so actual values
are compared and logged—use initial_boot_id (set by get_boot_id "$gpu_node")
consistently (not original_boot_id), expand variables with $ when logging and in
the error message, and ensure final_boot_id is compared to $initial_boot_id in
the if condition and the error/log calls (references: initial_boot_id,
final_boot_id, get_boot_id, error, log).
🧹 Nitpick comments (3)
tests/syslog_health_monitor_test.go (1)

114-135: 💤 Low value

Duplicate test case: identical to "Inject XID error requiring GPU reset" on lines 67-88.

This test case appears to be an exact duplicate of the first assess block (lines 67-88). Both inject the same XID 119 message and verify the same expected sequence patterns. This may be intentional to set up state before the "failed GPU reset" test, but consider adding a comment explaining the purpose or consolidating if unintentional.

If intentional, add clarifying comment
+	// Re-inject XID error to set up state for the failed GPU reset test scenario.
+	// The previous successful GPU reset cleared the condition, so we need a new error.
 	feature.Assess("Inject XID error requiring GPU reset", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/syslog_health_monitor_test.go` around lines 114 - 135, This Assess
block duplicates the earlier "Inject XID error requiring GPU reset" test; either
remove the duplicate or explicitly document why it’s repeated: if it’s
intentional to prime state for the subsequent "failed GPU reset" test, add a
single-line comment before the feature.Assess call clarifying that purpose and
reference helpers.InjectSyslogMessages and
helpers.VerifyNodeConditionMatchesSequence so reviewers know it is a deliberate
state-priming injection; otherwise consolidate by reusing the original Assess or
a shared helper function to avoid duplicated assertions.
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

264-268: 💤 Low value

Comment contains typos and grammar issues.

The comment block has readability issues: "pikcup" should be "pick up", and the grammar could be improved.

Suggested fix
 /*
-Flows could be from DCGM + syslog for initial event
-the healthy event for reset always from syslog OR unhealthy event needing reboot always from syslog as well
-we require that either DCGM will pikcup the reboot
+Flows could be from DCGM + syslog for the initial event.
+The healthy event for reset always comes from syslog. An unhealthy event needing reboot also comes from syslog.
+We require that DCGM will pick up the reboot.
 */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go` around lines
264 - 268, Update the block comment in xid_handler.go that currently reads
"Flows could be from DCGM + syslog for initial event the healthy event for reset
always from syslog OR unhealthy event needing reboot always from syslog as well
we require that either DCGM will pikcup the reboot" by fixing typos and
improving grammar (replace "pikcup" with "pick up" and rephrase sentences for
clarity), e.g., explain flows: initial events may come from DCGM or syslog,
healthy/reset and unhealthy/reboot events originate from syslog, and either DCGM
or syslog must pick up the reboot; apply this corrected wording to the existing
comment block in xid_handler.go.
janitor/api/v1alpha1/gpureset_types.go (1)

117-122: 💤 Low value

Minor: Comment slightly misrepresents behavior.

The comment states the syslog entry is written "upon successful reset," but the implementation in gpu_reset.sh writes to syslog for both successful and failed resets (with success: true|false). Consider updating the comment for accuracy.

Suggested documentation fix
-	// WriteSyslogEvent controls whether the GPU reset job writes a syslog entry
-	// upon successful reset, which triggers the syslog-health-monitor.
+	// WriteSyslogEvent controls whether the GPU reset job writes a syslog entry
+	// after reset completion (success or failure), which triggers the syslog-health-monitor.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@janitor/api/v1alpha1/gpureset_types.go` around lines 117 - 122, Update the
comment for the WriteSyslogEvent field to accurately reflect gpu_reset.sh
behavior: state that when enabled the job writes a syslog entry for reset
attempts regardless of outcome (including success: true or false), not only on
successful resets. Reference the WriteSyslogEvent field and gpu_reset.sh in the
comment so readers know the behavior is driven by that script.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gpu-reset/gpu_reset.sh`:
- Line 172: The if-condition referencing WRITE_SYSLOG_EVENT will fail under set
-u if the variable is unset; change the check in the if that currently reads the
WRITE_SYSLOG_EVENT variable so it uses a safe default (e.g., parameter expansion
like ${WRITE_SYSLOG_EVENT:-false}) to avoid unbound variable errors when
evaluating the condition; update the if that guards syslog writing to use the
safe expansion so the script continues even if WRITE_SYSLOG_EVENT is not
exported.

In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 285-306: The GPU reset HealthEvent construction is missing the
ProcessingStrategy field; update the block that builds the event (the event :=
&pb.HealthEvent{...} and the success/else branches in xid_handler.go) to set
ProcessingStrategy: xidHandler.processingStrategy so both success and failure
GPU reset paths include ProcessingStrategy (consistent with
createHealthEventFromResponse).

---

Outside diff comments:
In `@tests/uat/tests.sh`:
- Around line 445-489: The boot ID check has variable name and expansion bugs:
replace the literal and wrong names so actual values are compared and logged—use
initial_boot_id (set by get_boot_id "$gpu_node") consistently (not
original_boot_id), expand variables with $ when logging and in the error
message, and ensure final_boot_id is compared to $initial_boot_id in the if
condition and the error/log calls (references: initial_boot_id, final_boot_id,
get_boot_id, error, log).

---

Nitpick comments:
In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 264-268: Update the block comment in xid_handler.go that currently
reads "Flows could be from DCGM + syslog for initial event the healthy event for
reset always from syslog OR unhealthy event needing reboot always from syslog as
well we require that either DCGM will pikcup the reboot" by fixing typos and
improving grammar (replace "pikcup" with "pick up" and rephrase sentences for
clarity), e.g., explain flows: initial events may come from DCGM or syslog,
healthy/reset and unhealthy/reboot events originate from syslog, and either DCGM
or syslog must pick up the reboot; apply this corrected wording to the existing
comment block in xid_handler.go.

In `@janitor/api/v1alpha1/gpureset_types.go`:
- Around line 117-122: Update the comment for the WriteSyslogEvent field to
accurately reflect gpu_reset.sh behavior: state that when enabled the job writes
a syslog entry for reset attempts regardless of outcome (including success: true
or false), not only on successful resets. Reference the WriteSyslogEvent field
and gpu_reset.sh in the comment so readers know the behavior is driven by that
script.

In `@tests/syslog_health_monitor_test.go`:
- Around line 114-135: This Assess block duplicates the earlier "Inject XID
error requiring GPU reset" test; either remove the duplicate or explicitly
document why it’s repeated: if it’s intentional to prime state for the
subsequent "failed GPU reset" test, add a single-line comment before the
feature.Assess call clarifying that purpose and reference
helpers.InjectSyslogMessages and helpers.VerifyNodeConditionMatchesSequence so
reviewers know it is a deliberate state-priming injection; otherwise consolidate
by reusing the original Assess or a shared helper function to avoid duplicated
assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fcf83a02-5280-4148-8ae1-107db633c9c3

📥 Commits

Reviewing files that changed from the base of the PR and between 7209217 and 9fd2612.

📒 Files selected for processing (8)
  • gpu-reset/gpu_reset.sh
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • janitor/api/v1alpha1/gpureset_types.go
  • janitor/pkg/controller/gpureset_controller.go
  • tests/syslog_health_monitor_test.go
  • tests/uat/tests.sh

Comment thread gpu-reset/gpu_reset.sh Outdated
Comment thread health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@janitor/api/v1alpha1/gpureset_types.go`:
- Around line 118-122: The field comment for WriteSyslogEvent is out of date:
update the doc comment above the WriteSyslogEvent *bool
`json:"writeSyslogEvent,omitempty"` line to state that the GPU reset job emits
result-aware syslog entries (including success: true/false) rather than only
writing on successful resets; keep the kubebuilder tags and optional annotation
unchanged and ensure the new text clearly states it controls whether a
result-aware syslog entry (indicating success or failure) is written.

In `@tests/uat/tests.sh`:
- Around line 445-447: The test sets initial_boot_id via
initial_boot_id=$(get_boot_id "$gpu_node") but later uses the unset
original_boot_id and logs the literal token; change all references to
original_boot_id to initial_boot_id (including the comparison/assertion and the
log call), and make sure you expand the variable in log and comparisons as
"$initial_boot_id" so the script remains safe under set -u; apply the same fix
to the other occurrence mentioned (the second block around the later
comparison).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9e5729a-9574-40d7-a5f2-56bb5f310e6c

📥 Commits

Reviewing files that changed from the base of the PR and between 9fd2612 and 9393a97.

📒 Files selected for processing (10)
  • distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
  • gpu-reset/gpu_reset.sh
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • janitor/api/v1alpha1/gpureset_types.go
  • janitor/api/v1alpha1/zz_generated.deepcopy.go
  • janitor/pkg/controller/gpureset_controller.go
  • tests/syslog_health_monitor_test.go
  • tests/uat/tests.sh
✅ Files skipped from review due to trivial changes (2)
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • janitor/api/v1alpha1/zz_generated.deepcopy.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • janitor/pkg/controller/gpureset_controller.go
  • tests/syslog_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go

Comment thread janitor/api/v1alpha1/gpureset_types.go Outdated
Comment thread tests/uat/tests.sh Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/syslog_health_monitor_test.go`:
- Around line 169-176: The test constructs a contradictory healthy event by
calling WithHealthy(true) while also setting WithFatal(true); update the
simulated healthy SysLogsXIDError event to use a non-fatal flag (call
WithFatal(false)) so the event represents a true healthy/reset state. Locate the
creation chain starting with helpers.NewHealthEvent(...) and change the
WithFatal invocation on that Healthy event to false, leaving WithHealthy(true)
and the other fields unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ff416b69-879e-41a3-956c-6ce063912b3d

📥 Commits

Reviewing files that changed from the base of the PR and between 9393a97 and 5b51a2e.

📒 Files selected for processing (13)
  • distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
  • gpu-reset/gpu_reset.sh
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • janitor/api/v1alpha1/gpureset_types.go
  • janitor/api/v1alpha1/gpureset_types_test.go
  • janitor/api/v1alpha1/zz_generated.deepcopy.go
  • janitor/pkg/controller/gpureset_controller.go
  • janitor/pkg/controller/gpureset_controller_test.go
  • tests/gpu_reset_test.go
  • tests/syslog_health_monitor_test.go
  • tests/uat/tests.sh
✅ Files skipped from review due to trivial changes (1)
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • janitor/api/v1alpha1/zz_generated.deepcopy.go
  • distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
  • janitor/pkg/controller/gpureset_controller.go
  • gpu-reset/gpu_reset.sh

Comment thread tests/syslog_health_monitor_test.go
@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 19.82% (+0.06%) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller 13.61% (-0.00%) 👎
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 66.67% (ø) 9 6 3
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 16.86% (+0.16%) 1619 (+38) 273 (+9) 1346 (+29) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 14.17% (-0.00%) 5259 (+37) 745 (+5) 4514 (+32) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 19.82% (+0.06%) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller 13.63% (+0.02%) 👍
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 66.67% (ø) 9 6 3
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 16.86% (+0.16%) 1619 (+38) 273 (+9) 1346 (+29) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 14.20% (+0.03%) 5259 (+37) 747 (+7) 4512 (+30) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go`:
- Around line 286-310: Update the inline example payload in xid_handler.go so
the healthevent.errorcode value matches the current emitted failure token:
replace the numeric '95' with the string "GPU_RESET_FAILURE" in the example
object (look for the healthevent block associated with checkname
'SysLogsXIDError'); also normalize any non-ASCII quote characters in that
example (e.g., around message, GPU_UUID, nodename) to standard ASCII quotes so
the sample is valid and consistent with the code that emits errorcode.

In `@janitor/pkg/controller/gpureset_controller_test.go`:
- Around line 760-768: The DeferCleanup currently deletes entryNode and the
GPUReset (entryResetName) but doesn't remove the per-entry Job, leaving syslog-*
jobs behind; update the cleanup in the table test to also delete the Job created
per entry (use the same entryResetName or the actual Job name pattern used when
creating jobs) by calling k8sClient.Delete(ctx, jobObj) and ignore NotFound
errors similar to the existing deletes (refer to DeferCleanup, entryNode,
entryResetName and k8sClient.Delete to locate where to add the deletion).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e5f47b03-4b0e-4721-857d-84510c2db3fe

📥 Commits

Reviewing files that changed from the base of the PR and between a866942 and 029cd74.

📒 Files selected for processing (13)
  • distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_gpuresets.yaml
  • gpu-reset/gpu_reset.sh
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • janitor/api/v1alpha1/gpureset_types.go
  • janitor/api/v1alpha1/gpureset_types_test.go
  • janitor/api/v1alpha1/zz_generated.deepcopy.go
  • janitor/pkg/controller/gpureset_controller.go
  • janitor/pkg/controller/gpureset_controller_test.go
  • tests/gpu_reset_test.go
  • tests/syslog_health_monitor_test.go
  • tests/uat/tests.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/gpu_reset_test.go
  • tests/syslog_health_monitor_test.go

Comment thread health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
Comment thread janitor/pkg/controller/gpureset_controller_test.go Outdated
@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 19.82% (+0.06%) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller 13.61% (-0.00%) 👎
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 66.67% (ø) 9 6 3
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 16.86% (+0.16%) 1619 (+38) 273 (+9) 1346 (+29) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 14.17% (-0.00%) 5259 (+37) 745 (+5) 4514 (+32) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

@natherz97 natherz97 changed the title [WIP] feat: fallback to a reboot if a GPU reset fails feat: fallback to a reboot if a GPU reset fails Apr 30, 2026
@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/api/v1alpha1 19.82% (+0.06%) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller 13.61% (-0.00%) 👎
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types.go 66.67% (ø) 9 6 3
github.com/nvidia/nvsentinel/janitor/api/v1alpha1/zz_generated.deepcopy.go 16.86% (+0.16%) 1619 (+38) 273 (+9) 1346 (+29) 👍
github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller.go 14.17% (-0.00%) 5259 (+37) 745 (+5) 4514 (+32) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/janitor/api/v1alpha1/gpureset_types_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/controller/gpureset_controller_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

@pteranodan pteranodan self-requested a review May 1, 2026 14:58
Comment thread janitor/api/v1alpha1/gpureset_types.go Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config 19.19% (+0.24%) 👍
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 17.30% (+0.85%) 318 (+20) 55 (+6) 263 (+14) 👍
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go 20.16% (ø) 615 124 491

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

Signed-off-by: Nathan Herz <nherz@nvidia.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ø)
github.com/nvidia/nvsentinel/janitor/pkg/config 19.19% (+0.24%) 👍
github.com/nvidia/nvsentinel/tests 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 17.30% (+0.85%) 318 (+20) 55 (+6) 263 (+14) 👍
github.com/nvidia/nvsentinel/janitor/pkg/config/default.go 20.16% (ø) 615 124 491

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/janitor/pkg/config/config_test.go
  • github.com/nvidia/nvsentinel/tests/gpu_reset_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants