feat: inject GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES by natherz97 · Pull Request #1243 · NVIDIA/NVSentinel

natherz97 · 2026-05-01T23:51:11Z

Summary

This PR includes 3 changes to the GPU reset workflow:

Start injecting GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES. Note that we're keeping the runtimeClassName=nvidia, eventually we could remove this runtime and only set NVIDIA_VISIBLE_DEVICES=void (to account for if the default runtime is nvidia).
Remove manual toggling of persistence mode on GPUs being reset. The nvidia-smi --gpu-reset command will automatically detect if persistence mode is enabled on GPUs being reset and handle disabling and enabling it on the GPU being reset as long as the /run/nvidia-persistenced is available in the container
Start evicting the gpu-feature-discovery pod in addition to nvidia-device-plugin, nvidia-dgcm, and nvidia-dcgm-exporter. Note that gpu-feature-discovery creates an NVML client in a loop so it's possible that a GPU reset could fail if the device handles exist during the reset.

Why do we need to remove the nvidia-container-toolkit?

We have observed that the GPU reset privileged pod is unable to have GPUs injected into the reset container using the NVIDIA_VISIBLE_DEVICES env var when the given GPU has certain XIDs requiring reset.

We were unable to start our reset job which needed GPUs injected from nvidia-container-toolkit:

message: 'failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: gpu requires reset: unknown'

How to support GPU reset inside a container

Using container-toolkit (current mode):

Set NVIDIA_VISIBLE_DEVICES=all which will inject GPU devices, nvidia-smi, libnvidia-ml.so.1, and /run/nvidia-persistenced into the container. If /run/nvidia-persistenced is not available, nvidia-smi cannot either manually toggle persistence mode via nvidia-smi -pm or automatically toggle it when running nvidia-smi -r.
Create a HostPath volume from /sys to /sys inside the container or set HostNetwork=true.

2a. Using HostPath volumes and relative paths with chroot (new mode):

Set NVIDIA_VISIBLE_DEVICES=void
Create a HostPath volume from /run/nvidia/driver to /run/nvidia/driver
Create a HostPath volume from /sys to /run/nvidia/driver/sys
There is no need to manually mount /run/nvidia-persistenced since it is available at /run/nvidia/driver/run/nvidia-persistenced
Run nvidia-smi commands using chroot: chroot /run/nvidia/driver nvidia-smi --gpu-reset -i 0

2b. Using HostPath volumes and absolute paths (not chosen):

Set NVIDIA_VISIBLE_DEVICES=void
Create a HostPath volume from /run/nvidia/driver to /run/nvidia/driver
Create a HostPath volume from /sys to /sys or use HostNetwork=true
Create a HostPath volume from /run/nvidia-persistenced to /run/nvidia-persistenced OR explore if it's possible to pass an explicit path for the nvidia-persistenced directory at /run/nvidia/driver/run/nvidia-persistenced (similar to what we're doing below with LD_PRELOAD).
Run nvidia-smi commands using absolute paths: env -i LD_PRELOAD=/run/nvidia/driver/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /run/nvidia/driver/usr/bin/nvidia-smi --gpu-reset -i 0
Note that we'd have to correctly find the path depending on if the architecture is x86_64 or aarch64 like what the DRA driver is doing: https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/blob/main/hack/kubelet-plugin-prestart.sh#L44

Testing

Confirmed persistence mode is enabled and is holding GPU device handles:

# nvidia-smi -pm 1
Persistence mode is already Enabled for GPU 00000000:0F:00.0.
Persistence mode is already Enabled for GPU 00000000:2D:00.0.
Persistence mode is already Enabled for GPU 00000000:44:00.0.
Persistence mode is already Enabled for GPU 00000000:5B:00.0.
Persistence mode is already Enabled for GPU 00000000:89:00.0.
Persistence mode is already Enabled for GPU 00000000:A8:00.0.
Persistence mode is already Enabled for GPU 00000000:C0:00.0.
Persistence mode is already Enabled for GPU 00000000:D8:00.0.
All done.

262276 -> nvidia-persistenced --persistence-mode
    /dev/nvidia0
    /dev/nvidia1
    /dev/nvidia2
    /dev/nvidia3
    /dev/nvidia4
    /dev/nvidia5
    /dev/nvidia6
    /dev/nvidia7
    /dev/nvidiactl

Issued a GPU reset:

 % kubectl logs gpureset-10.0.7.36-reset-job-z4826 -n dgxc-janitor-system -f
(2026-04-23 18:49:13) INFO: Using DRIVER_ROOT=/run/nvidia/driver
(2026-04-23 18:49:13) INFO: Testing nvidia-smi invocation: chroot /run/nvidia/driver nvidia-smi --version
NVIDIA-SMI version  : 580.95.05
NVML version        : 580.95
DRIVER version      : 580.95.05
CUDA Version        : 13.0
(2026-04-23 18:49:14) INFO: Starting GPU reset workflow...
(2026-04-23 18:49:14) INFO: Determining target devices...
(2026-04-23 18:49:14) INFO: Targets:
  GPU-4f60ecd4-e5da-28b6-256f-f517bbc5eb6a
(2026-04-23 18:49:15) INFO: Resetting GPUs...
  GPU 00000000:0F:00.0 was successfully reset
(2026-04-23 18:50:01) INFO: GPU reset complete.
(2026-04-23 18:50:01) INFO: Running post-reset health check...
(2026-04-23 18:50:10) INFO: Post-reset health check passed.
(2026-04-23 18:50:10) SUCCESS: GPU reset workflow completed in 56.856s.
(2026-04-23 18:50:10) Writing reset success for GPU-4f60ecd4-e5da-28b6-256f-f517bbc5eb6a to syslog

Confirmed that expected pods were torn down:

gpu-feature-discovery
nvidia-dcgm-exporter
nvidia-dcgm
nvidia-driver-daemonset

nherz@DP66VX7CLX cluster-access % kubectl get pods -n gpu-operator \
  --field-selector spec.nodeName=$NODE
NAME                                               READY   STATUS      RESTARTS       AGE
gpu-feature-discovery-2pmnt                        1/1     Running     0              45s
gpu-operator-node-feature-discovery-worker-xqwsq   1/1     Running     3              36d
nvidia-container-toolkit-daemonset-648xk           1/1     Running     0              22d
nvidia-cuda-validator-q4vxf                        0/1     Completed   0              22d
nvidia-dcgm-exporter-dnxgw                         1/1     Running     0              45s
nvidia-dcgm-fjbpz                                  1/1     Running     0              45s
nvidia-device-plugin-daemonset-cgh6f               1/1     Running     0              45s
nvidia-driver-daemonset-69j8k                      3/3     Running     18 (22d ago)   36d
nvidia-mig-manager-h8js9                           1/1     Running     0              22d
nvidia-operator-validator-qvsxf                    1/1     Running     0              22d

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

Release Notes

New Features
- GPU reset commands can now execute within a configurable driver root environment for improved isolation
- Added support for managing the gpu-feature-discovery service alongside other GPU services
Improvements
- GPU reset jobs now include proper volume mounts for driver root and system filesystem, enabling enhanced GPU resource access
- Simplified the reset workflow by removing persistence mode state capture and restoration logic

…IA_VISIBLE_DEVICES Signed-off-by: Nathan Herz <nherz@nvidia.com>

coderabbitai · 2026-05-01T23:51:23Z

📝 Walkthrough

Walkthrough

The PR enhances GPU reset functionality to support a configurable driver root via chroot, removes persistence mode management, updates the reset job's Kubernetes pod configuration to expose driver paths through volumes and environment variables, and registers GPU feature discovery as a managed service.

Changes

Driver Root Support for GPU Reset

Layer / File(s)	Summary
Driver Root Wrapper & Shell Configuration `gpu-reset/gpu_reset.sh`	Introduces `DRIVER_ROOT` environment variable (defaulting to `/`) and adds `nvidia_smi_helper()` function that executes `nvidia-smi` commands within the configured driver root via `chroot`. Startup logging validates the setup by querying `nvidia-smi --version` through this wrapper.
GPU Reset Job Wiring `janitor/pkg/config/default.go`	Adds exported constants for driver root and host sysfs volume names and paths. Updates `getDefaultGPUResetJobTemplate` to define pod volumes for driver root and host sysfs, mount them into the GPU reset container at configured paths, and set `DRIVER_ROOT` and `NVIDIA_VISIBLE_DEVICES` environment variables. Removes `HostNetwork: true` from pod spec.
Reset & Target Discovery `gpu-reset/gpu_reset.sh`	Updates GPU UUID discovery (when no targets specified), reset execution (`--gpu-reset`), and post-reset health checks (`-q` queries) to use the new `nvidia_smi_helper` wrapper instead of calling `nvidia-smi` directly. Removes persistence mode pre-reset capture and post-reset restoration logic (including `PM_STATES_FILE`).
Script Cleanup & Exit `gpu-reset/gpu_reset.sh`	Adjusts temp-file cleanup to exclude removed PM-state file. Changes final exit statement from quoted to unquoted variable reference.

GPU Feature Discovery Service Registration

Layer / File(s)	Summary
Service Registry `janitor/pkg/gpuservices/manager.go`	Registers `gpu-feature-discovery` as a managed GPU service with app selector `app: gpu-feature-discovery`, node label `nvidia.com/gpu.deploy.gpu-feature-discovery`, and enable/disable flag values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 A driver root in chroots we trust,
No host network's muss or fuss,
Reset flows through volumes clean,
GPU features newly seen,
Code springs forward, swift and bright! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title describes using HostPath volumes instead of NVIDIA_VISIBLE_DEVICES, which is the primary architectural change. However, the PR also removes persistence mode toggling and adds gpu-feature-discovery pod eviction—significant changes not reflected in the title.	Consider a more comprehensive title like 'feat: inject GPU devices via HostPath volumes, remove persistence mode toggling, and evict gpu-feature-discovery' or keep the current title if it represents the most important change from the dev's perspective.
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gpu-reset/gpu_reset.sh`:
- Line 111: The parameter expansion using a default value is redundant: remove
the ':-1' default from the assignment so the script simply uses the existing
FINAL_EXIT_STATUS variable (i.e., change the assignment to just reference
FINAL_EXIT_STATUS without a default). Update the occurrence in gpu_reset.sh
where FINAL_EXIT_STATUS is set in the finalization branch so it doesn’t supply
':-1', leaving behavior unchanged because FINAL_EXIT_STATUS is already
initialized earlier and guaranteed non-empty by the preceding check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 006c4eb7-9740-4e0b-b1e4-e00d2256e402

📥 Commits

Reviewing files that changed from the base of the PR and between 7209217 and 43feb6c.

📒 Files selected for processing (3)

gpu-reset/gpu_reset.sh
janitor/pkg/config/default.go
janitor/pkg/gpuservices/manager.go

coderabbitai · 2026-05-01T23:54:12Z

    log "ERROR: Post-reset health check failed. See details below:"
    sed 's/^/  /' "$HEALTH_CHECK_OUTPUT_FILE"
-    FINAL_EXIT_STATUS=1
+    FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Redundant parameter expansion default.

FINAL_EXIT_STATUS is always initialized to 0 at line 54 and is guaranteed to be 0 when this branch executes (per the check at line 103). The :-1 default is never used.

Suggested fix

- FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1} + FINAL_EXIT_STATUS=1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1}

FINAL_EXIT_STATUS=1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@gpu-reset/gpu_reset.sh` at line 111, The parameter expansion using a default value is redundant: remove the ':-1' default from the assignment so the script simply uses the existing FINAL_EXIT_STATUS variable (i.e., change the assignment to just reference FINAL_EXIT_STATUS without a default). Update the occurrence in gpu_reset.sh where FINAL_EXIT_STATUS is set in the finalization branch so it doesn’t supply ':-1', leaving behavior unchanged because FINAL_EXIT_STATUS is already initialized earlier and guaranteed non-empty by the preceding check.

github-actions · 2026-05-01T23:55:47Z

🌿 Fern Docs Preview: https://nvidia-preview-pull-request-1243.docs.buildwithfern.com/nvsentinel

feat: inject GPU devices with HostPath volumes rather than using NVID…

43feb6c

…IA_VISIBLE_DEVICES Signed-off-by: Nathan Herz <nherz@nvidia.com>

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: inject GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES#1243

feat: inject GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES#1243
natherz97 wants to merge 1 commit intoNVIDIA:mainfrom
natherz97:remove-container-toolkit

natherz97 commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1}
	FINAL_EXIT_STATUS=1

Conversation

natherz97 commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why do we need to remove the nvidia-container-toolkit?

How to support GPU reset inside a container

Testing

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

natherz97 commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading