feat: inject GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES#1243
feat: inject GPU devices with HostPath volumes rather than using NVIDIA_VISIBLE_DEVICES#1243natherz97 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…IA_VISIBLE_DEVICES Signed-off-by: Nathan Herz <nherz@nvidia.com>
📝 WalkthroughWalkthroughThe PR enhances GPU reset functionality to support a configurable driver root via ChangesDriver Root Support for GPU Reset
GPU Feature Discovery Service Registration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@gpu-reset/gpu_reset.sh`:
- Line 111: The parameter expansion using a default value is redundant: remove
the ':-1' default from the assignment so the script simply uses the existing
FINAL_EXIT_STATUS variable (i.e., change the assignment to just reference
FINAL_EXIT_STATUS without a default). Update the occurrence in gpu_reset.sh
where FINAL_EXIT_STATUS is set in the finalization branch so it doesn’t supply
':-1', leaving behavior unchanged because FINAL_EXIT_STATUS is already
initialized earlier and guaranteed non-empty by the preceding check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 006c4eb7-9740-4e0b-b1e4-e00d2256e402
📒 Files selected for processing (3)
gpu-reset/gpu_reset.shjanitor/pkg/config/default.gojanitor/pkg/gpuservices/manager.go
| log "ERROR: Post-reset health check failed. See details below:" | ||
| sed 's/^/ /' "$HEALTH_CHECK_OUTPUT_FILE" | ||
| FINAL_EXIT_STATUS=1 | ||
| FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1} |
There was a problem hiding this comment.
Redundant parameter expansion default.
FINAL_EXIT_STATUS is always initialized to 0 at line 54 and is guaranteed to be 0 when this branch executes (per the check at line 103). The :-1 default is never used.
Suggested fix
- FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1}
+ FINAL_EXIT_STATUS=1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FINAL_EXIT_STATUS=${FINAL_EXIT_STATUS:-1} | |
| FINAL_EXIT_STATUS=1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@gpu-reset/gpu_reset.sh` at line 111, The parameter expansion using a default
value is redundant: remove the ':-1' default from the assignment so the script
simply uses the existing FINAL_EXIT_STATUS variable (i.e., change the assignment
to just reference FINAL_EXIT_STATUS without a default). Update the occurrence in
gpu_reset.sh where FINAL_EXIT_STATUS is set in the finalization branch so it
doesn’t supply ':-1', leaving behavior unchanged because FINAL_EXIT_STATUS is
already initialized earlier and guaranteed non-empty by the preceding check.
|
🌿 Fern Docs Preview: https://nvidia-preview-pull-request-1243.docs.buildwithfern.com/nvsentinel |
Summary
This PR includes 3 changes to the GPU reset workflow:
Why do we need to remove the nvidia-container-toolkit?
We have observed that the GPU reset privileged pod is unable to have GPUs injected into the reset container using the NVIDIA_VISIBLE_DEVICES env var when the given GPU has certain XIDs requiring reset.
We were unable to start our reset job which needed GPUs injected from nvidia-container-toolkit:
How to support GPU reset inside a container
2a. Using HostPath volumes and relative paths with chroot (new mode):
2b. Using HostPath volumes and absolute paths (not chosen):
Testing
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
Release Notes
New Features
Improvements