Skip to content

feat: inject GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES#1243

Open
natherz97 wants to merge 1 commit intoNVIDIA:mainfrom
natherz97:remove-container-toolkit
Open

feat: inject GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES#1243
natherz97 wants to merge 1 commit intoNVIDIA:mainfrom
natherz97:remove-container-toolkit

Conversation

@natherz97
Copy link
Copy Markdown
Contributor

@natherz97 natherz97 commented May 1, 2026

Summary

This PR includes 3 changes to the GPU reset workflow:

  1. Start injecting GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES. Note that we're keeping the runtimeClassName=nvidia, eventually we could remove this runtime and only set NVIDIA_VISIBLE_DEVICES=void (to account for if the default runtime is nvidia).
  2. Remove manual toggling of persistence mode on GPUs being reset. The nvidia-smi --gpu-reset command will automatically detect if persistence mode is enabled on GPUs being reset and handle disabling and enabling it on the GPU being reset as long as the /run/nvidia-persistenced is available in the container
  3. Start evicting the gpu-feature-discovery pod in addition to nvidia-device-plugin, nvidia-dgcm, and nvidia-dcgm-exporter. Note that gpu-feature-discovery creates an NVML client in a loop so it's possible that a GPU reset could fail if the device handles exist during the reset.

Why do we need to remove the nvidia-container-toolkit?

We have observed that the GPU reset privileged pod is unable to have GPUs injected into the reset container using the NVIDIA_VISIBLE_DEVICES env var when the given GPU has certain XIDs requiring reset.

We were unable to start our reset job which needed GPUs injected from nvidia-container-toolkit:

message: 'failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: gpu requires reset: unknown'

How to support GPU reset inside a container

  1. Using container-toolkit (current mode):
  • Set NVIDIA_VISIBLE_DEVICES=all which will inject GPU devices, nvidia-smi, libnvidia-ml.so.1, and /run/nvidia-persistenced into the container. If /run/nvidia-persistenced is not available, nvidia-smi cannot either manually toggle persistence mode via nvidia-smi -pm or automatically toggle it when running nvidia-smi -r.
  • Create a HostPath volume from /sys to /sys inside the container or set HostNetwork=true.

2a. Using HostPath volumes and relative paths with chroot (new mode):

  • Set NVIDIA_VISIBLE_DEVICES=void
  • Create a HostPath volume from /run/nvidia/driver to /run/nvidia/driver
  • Create a HostPath volume from /sys to /run/nvidia/driver/sys
  • There is no need to manually mount /run/nvidia-persistenced since it is available at /run/nvidia/driver/run/nvidia-persistenced
  • Run nvidia-smi commands using chroot: chroot /run/nvidia/driver nvidia-smi --gpu-reset -i 0

2b. Using HostPath volumes and absolute paths (not chosen):

  • Set NVIDIA_VISIBLE_DEVICES=void
  • Create a HostPath volume from /run/nvidia/driver to /run/nvidia/driver
  • Create a HostPath volume from /sys to /sys or use HostNetwork=true
  • Create a HostPath volume from /run/nvidia-persistenced to /run/nvidia-persistenced OR explore if it's possible to pass an explicit path for the nvidia-persistenced directory at /run/nvidia/driver/run/nvidia-persistenced (similar to what we're doing below with LD_PRELOAD).
  • Run nvidia-smi commands using absolute paths: env -i LD_PRELOAD=/run/nvidia/driver/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /run/nvidia/driver/usr/bin/nvidia-smi --gpu-reset -i 0
  • Note that we'd have to correctly find the path depending on if the architecture is x86_64 or aarch64 like what the DRA driver is doing: https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/blob/main/hack/kubelet-plugin-prestart.sh#L44

Testing

  1. Confirmed persistence mode is enabled and is holding GPU device handles:
# nvidia-smi -pm 1
Persistence mode is already Enabled for GPU 00000000:0F:00.0.
Persistence mode is already Enabled for GPU 00000000:2D:00.0.
Persistence mode is already Enabled for GPU 00000000:44:00.0.
Persistence mode is already Enabled for GPU 00000000:5B:00.0.
Persistence mode is already Enabled for GPU 00000000:89:00.0.
Persistence mode is already Enabled for GPU 00000000:A8:00.0.
Persistence mode is already Enabled for GPU 00000000:C0:00.0.
Persistence mode is already Enabled for GPU 00000000:D8:00.0.
All done.

262276 -> nvidia-persistenced --persistence-mode
    /dev/nvidia0
    /dev/nvidia1
    /dev/nvidia2
    /dev/nvidia3
    /dev/nvidia4
    /dev/nvidia5
    /dev/nvidia6
    /dev/nvidia7
    /dev/nvidiactl
  1. Issued a GPU reset:
 % kubectl logs gpureset-10.0.7.36-reset-job-z4826 -n dgxc-janitor-system -f
(2026-04-23 18:49:13) INFO: Using DRIVER_ROOT=/run/nvidia/driver
(2026-04-23 18:49:13) INFO: Testing nvidia-smi invocation: chroot /run/nvidia/driver nvidia-smi --version
NVIDIA-SMI version  : 580.95.05
NVML version        : 580.95
DRIVER version      : 580.95.05
CUDA Version        : 13.0
(2026-04-23 18:49:14) INFO: Starting GPU reset workflow...
(2026-04-23 18:49:14) INFO: Determining target devices...
(2026-04-23 18:49:14) INFO: Targets:
  GPU-4f60ecd4-e5da-28b6-256f-f517bbc5eb6a
(2026-04-23 18:49:15) INFO: Resetting GPUs...
  GPU 00000000:0F:00.0 was successfully reset
(2026-04-23 18:50:01) INFO: GPU reset complete.
(2026-04-23 18:50:01) INFO: Running post-reset health check...
(2026-04-23 18:50:10) INFO: Post-reset health check passed.
(2026-04-23 18:50:10) SUCCESS: GPU reset workflow completed in 56.856s.
(2026-04-23 18:50:10) Writing reset success for GPU-4f60ecd4-e5da-28b6-256f-f517bbc5eb6a to syslog
  1. Confirmed that expected pods were torn down:
  • gpu-feature-discovery
  • nvidia-dcgm-exporter
  • nvidia-dcgm
  • nvidia-driver-daemonset
nherz@DP66VX7CLX cluster-access % kubectl get pods -n gpu-operator \
  --field-selector spec.nodeName=$NODE
NAME                                               READY   STATUS      RESTARTS       AGE
gpu-feature-discovery-2pmnt                        1/1     Running     0              45s
gpu-operator-node-feature-discovery-worker-xqwsq   1/1     Running     3              36d
nvidia-container-toolkit-daemonset-648xk           1/1     Running     0              22d
nvidia-cuda-validator-q4vxf                        0/1     Completed   0              22d
nvidia-dcgm-exporter-dnxgw                         1/1     Running     0              45s
nvidia-dcgm-fjbpz                                  1/1     Running     0              45s
nvidia-device-plugin-daemonset-cgh6f               1/1     Running     0              45s
nvidia-driver-daemonset-69j8k                      3/3     Running     18 (22d ago)   36d
nvidia-mig-manager-h8js9                           1/1     Running     0              22d
nvidia-operator-validator-qvsxf                    1/1     Running     0              22d

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

Release Notes

  • New Features

    • GPU reset commands can now execute within a configurable driver root environment for improved isolation
    • Added support for managing the gpu-feature-discovery service alongside other GPU services
  • Improvements

    • GPU reset jobs now include proper volume mounts for driver root and system filesystem, enabling enhanced GPU resource access
    • Simplified the reset workflow by removing persistence mode state capture and restoration logic

…IA_VISIBLE_DEVICES

Signed-off-by: Nathan Herz <nherz@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

📝 Walkthrough

Walkthrough

The PR enhances GPU reset functionality to support a configurable driver root via chroot, removes persistence mode management, updates the reset job's Kubernetes pod configuration to expose driver paths through volumes and environment variables, and registers GPU feature discovery as a managed service.

Changes

Driver Root Support for GPU Reset

Layer / File(s) Summary
Driver Root Wrapper & Shell Configuration
gpu-reset/gpu_reset.sh
Introduces DRIVER_ROOT environment variable (defaulting to /) and adds nvidia_smi_helper() function that executes nvidia-smi commands within the configured driver root via chroot. Startup logging validates the setup by querying nvidia-smi --version through this wrapper.
GPU Reset Job Wiring
janitor/pkg/config/default.go
Adds exported constants for driver root and host sysfs volume names and paths. Updates getDefaultGPUResetJobTemplate to define pod volumes for driver root and host sysfs, mount them into the GPU reset container at configured paths, and set DRIVER_ROOT and NVIDIA_VISIBLE_DEVICES environment variables. Removes HostNetwork: true from pod spec.
Reset & Target Discovery
gpu-reset/gpu_reset.sh
Updates GPU UUID discovery (when no targets specified), reset execution (--gpu-reset), and post-reset health checks (-q queries) to use the new nvidia_smi_helper wrapper instead of calling nvidia-smi directly. Removes persistence mode pre-reset capture and post-reset restoration logic (including PM_STATES_FILE).
Script Cleanup & Exit
gpu-reset/gpu_reset.sh
Adjusts temp-file cleanup to exclude removed PM-state file. Changes final exit statement from quoted to unquoted variable reference.

GPU Feature Discovery Service Registration

Layer / File(s) Summary
Service Registry
janitor/pkg/gpuservices/manager.go
Registers gpu-feature-discovery as a managed GPU service with app selector app: gpu-feature-discovery, node label nvidia.com/gpu.deploy.gpu-feature-discovery, and enable/disable flag values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 A driver root in chroots we trust,
No host network's muss or fuss,
Reset flows through volumes clean,
GPU features newly seen,
Code springs forward, swift and bright!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title describes using HostPath volumes instead of NVIDIA_VISIBLE_DEVICES, which is the primary architectural change. However, the PR also removes persistence mode toggling and adds gpu-feature-discovery pod eviction—significant changes not reflected in the title. Consider a more comprehensive title like 'feat: inject GPU devices via HostPath volumes, remove persistence mode toggling, and evict gpu-feature-discovery' or keep the current title if it represents the most important change from the dev's perspective.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gpu-reset/gpu_reset.sh`:
- Line 111: The parameter expansion using a default value is redundant: remove
the ':-1' default from the assignment so the script simply uses the existing
FINAL_EXIT_STATUS variable (i.e., change the assignment to just reference
FINAL_EXIT_STATUS without a default). Update the occurrence in gpu_reset.sh
where FINAL_EXIT_STATUS is set in the finalization branch so it doesn’t supply
':-1', leaving behavior unchanged because FINAL_EXIT_STATUS is already
initialized earlier and guaranteed non-empty by the preceding check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 006c4eb7-9740-4e0b-b1e4-e00d2256e402

📥 Commits

Reviewing files that changed from the base of the PR and between 7209217 and 43feb6c.

📒 Files selected for processing (3)
  • gpu-reset/gpu_reset.sh
  • janitor/pkg/config/default.go
  • janitor/pkg/gpuservices/manager.go

Comment thread gpu-reset/gpu_reset.sh
log "ERROR: Post-reset health check failed. See details below:"
sed 's/^/ /' "$HEALTH_CHECK_OUTPUT_FILE"
FINAL_EXIT_STATUS=1
FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Redundant parameter expansion default.

FINAL_EXIT_STATUS is always initialized to 0 at line 54 and is guaranteed to be 0 when this branch executes (per the check at line 103). The :-1 default is never used.

Suggested fix
-    FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1}
+    FINAL_EXIT_STATUS=1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1}
FINAL_EXIT_STATUS=1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gpu-reset/gpu_reset.sh` at line 111, The parameter expansion using a default
value is redundant: remove the ':-1' default from the assignment so the script
simply uses the existing FINAL_EXIT_STATUS variable (i.e., change the assignment
to just reference FINAL_EXIT_STATUS without a default). Update the occurrence in
gpu_reset.sh where FINAL_EXIT_STATUS is set in the finalization branch so it
doesn’t supply ':-1', leaving behavior unchanged because FINAL_EXIT_STATUS is
already initialized earlier and guaranteed non-empty by the preceding check.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant