Prevent same physical device for pods in a podgroup with vGPU #5049
Prevent same physical device for pods in a podgroup with vGPU #5049goyalankit wants to merge 4 commits intovolcano-sh:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @goyalankit! It looks like this is your first PR to volcano-sh/volcano 🎉 |
Summary of ChangesHello @goyalankit, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the vGPU scheduling capabilities by preventing multiple pods from the same PodGroup from being assigned to the same physical GPU device. This feature, controlled by a new flag, addresses potential NCCL communication failures in distributed training workloads by ensuring better device spread for co-located pods. The changes involve updating core data structures, adding new utility functions, and integrating the new spread logic into the scheduler's predicate functions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
0a3b4b6 to
a08617d
Compare
Signed-off-by: Ankit Goyal <ankit3goyal@gmail.com>
a08617d to
19b373b
Compare
There was a problem hiding this comment.
Pull request overview
Adds an optional vGPU scheduling constraint (behind deviceshare.VGPUPodGroupDeviceSpread) to avoid assigning the same physical GPUDevice to multiple pods from the same PodGroup, preventing intra-device NCCL communication failures.
Changes:
- Introduces a new scheduler/plugin argument
deviceshare.VGPUPodGroupDeviceSpreadand wires it to the vgpu package feature flag. - Tracks a per-pod
PodGroupKeyin vGPU device usage and skips devices already used by the same PodGroup during allocation. - Adds/updates unit tests and documents the new configuration option.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/scheduler/plugins/deviceshare/deviceshare.go | Adds new plugin argument constant and plumbs it into vgpu.VGPUPodGroupDeviceSpread. |
| pkg/scheduler/api/devices/nvidia/vgpu/type.go | Adds global feature flag for podgroup device spreading. |
| pkg/scheduler/api/devices/nvidia/vgpu/utils.go | Implements PodGroup key extraction + device filtering logic gated by the feature flag. |
| pkg/scheduler/api/devices/nvidia/vgpu/device_info.go | Persists PodGroupKey onto GPUUsage entries when pods are added/tracked. |
| pkg/scheduler/api/devices/nvidia/vgpu/utils_test.go | Adds unit tests for new helper functions (currently contains a compile issue). |
| pkg/scheduler/api/devices/nvidia/vgpu/device_info_test.go | Adds test ensuring PodGroupKey is set when adding resources. |
| docs/user-guide/how_to_use_volcano_vgpu.md | Documents the new optional scheduler argument. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": {PodGroupKey: "ns/pg", UsedMem: 1000}}}, | ||
| currentKey: "", | ||
| want: false, | ||
| }, | ||
| { | ||
| name: "no pod from same group", | ||
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": {PodGroupKey: "ns/other", UsedMem: 1000}}}, | ||
| currentKey: "ns/my-pg", | ||
| want: false, | ||
| }, | ||
| { | ||
| name: "same group with non-zero usage", | ||
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": {PodGroupKey: "ns/my-pg", UsedMem: 1000, UsedCore: 50}}}, | ||
| currentKey: "ns/my-pg", | ||
| want: true, | ||
| }, | ||
| { | ||
| name: "same group but zero usage (released)", | ||
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": {PodGroupKey: "ns/my-pg", UsedMem: 0, UsedCore: 0}}}, |
There was a problem hiding this comment.
This test won't compile: PodMap is a map[string]*GPUUsage, but the literals use struct values ({...}) instead of pointers. Use &GPUUsage{...} (or change the map value type) for each entry to satisfy the declared type.
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": {PodGroupKey: "ns/pg", UsedMem: 1000}}}, | |
| currentKey: "", | |
| want: false, | |
| }, | |
| { | |
| name: "no pod from same group", | |
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": {PodGroupKey: "ns/other", UsedMem: 1000}}}, | |
| currentKey: "ns/my-pg", | |
| want: false, | |
| }, | |
| { | |
| name: "same group with non-zero usage", | |
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": {PodGroupKey: "ns/my-pg", UsedMem: 1000, UsedCore: 50}}}, | |
| currentKey: "ns/my-pg", | |
| want: true, | |
| }, | |
| { | |
| name: "same group but zero usage (released)", | |
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": {PodGroupKey: "ns/my-pg", UsedMem: 0, UsedCore: 0}}}, | |
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": &GPUUsage{PodGroupKey: "ns/pg", UsedMem: 1000}}}, | |
| currentKey: "", | |
| want: false, | |
| }, | |
| { | |
| name: "no pod from same group", | |
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": &GPUUsage{PodGroupKey: "ns/other", UsedMem: 1000}}}, | |
| currentKey: "ns/my-pg", | |
| want: false, | |
| }, | |
| { | |
| name: "same group with non-zero usage", | |
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": &GPUUsage{PodGroupKey: "ns/my-pg", UsedMem: 1000, UsedCore: 50}}}, | |
| currentKey: "ns/my-pg", | |
| want: true, | |
| }, | |
| { | |
| name: "same group but zero usage (released)", | |
| gd: &GPUDevice{PodMap: map[string]*GPUUsage{"uid1": &GPUUsage{PodGroupKey: "ns/my-pg", UsedMem: 0, UsedCore: 0}}}, |
| func deviceHasPodFromSameGroup(gd *GPUDevice, currentKey string) bool { | ||
| if currentKey == "" { | ||
| return false | ||
| } | ||
| for _, usage := range gd.PodMap { | ||
| if usage.PodGroupKey == currentKey && (usage.UsedMem > 0 || usage.UsedCore > 0) { | ||
| return true | ||
| } | ||
| } | ||
| return false |
There was a problem hiding this comment.
deviceHasPodFromSameGroup dereferences gd.PodMap entries as *GPUUsage without checking for gd == nil or usage == nil. Since PodMap is a map[string]*GPUUsage, a nil value would panic here. Add defensive checks (e.g., return false when gd == nil, and skip nil usage entries).
There was a problem hiding this comment.
Code Review
This pull request introduces the VGPUPodGroupDeviceSpread feature, aiming to prevent multiple pods from the same PodGroup from being scheduled onto the same physical GPU device, which helps avoid NCCL communication issues in distributed workloads. However, the implementation introduces a significant race condition in the PodMap structure. This PodMap is shared across scheduling snapshots and is accessed and modified concurrently without synchronization, which can lead to scheduler panics and Denial of Service. Additionally, there is a minor redundancy in the state management logic that could be optimized.
| PodGroupKey: getPodGroupKey(pod), | ||
| } | ||
| } | ||
|
|
||
| gsdevice.PodMap[podUID].UsedMem += deviceused.Usedmem |
There was a problem hiding this comment.
The map write to gsdevice.PodMap (lines 230-234) is not synchronized and can occur concurrently with map iterations in deviceHasPodFromSameGroup (introduced in utils.go), leading to a runtime panic and Denial of Service vulnerability. Concurrent map writes and iterations in Go are not thread-safe. While PodGroupKey is initialized within the if !ok block at line 233, the overall access to gsdevice.PodMap requires thread-safe synchronization, especially since GPUDevices is shared and updated asynchronously.
| for _, usage := range gd.PodMap { | ||
| if usage.PodGroupKey == currentKey && (usage.UsedMem > 0 || usage.UsedCore > 0) { | ||
| return true | ||
| } | ||
| } |
There was a problem hiding this comment.
The deviceHasPodFromSameGroup function iterates over gd.PodMap without any synchronization. This map is shared across snapshots and is concurrently modified by AddResource, SubResource, and Allocate calls (which can be triggered by asynchronous event handlers). Concurrent map iteration and write operations in Go result in a runtime panic, which would crash the scheduler and cause a Denial of Service. To fix this, ensure that PodMap is deep-copied when creating a snapshot in getGPUDeviceSnapShot, or use a mutex to synchronize access.
Signed-off-by: Ankit Goyal <ankit3goyal@gmail.com>
Replace locally defined podGroupAnnotationKey and volcanoPodGroupAnnotation constants with the canonical KubeGroupNameAnnotationKey and VolcanoGroupNameAnnotationKey from volcano.sh/apis/pkg/apis/scheduling/v1beta1. Signed-off-by: Ankit Goyal <ankit3goyal@gmail.com>
Signed-off-by: Ankit Goyal <ankit3goyal@gmail.com>
ed1f400 to
5aee881
Compare
What type of PR is this?
Feature
What this PR does / why we need it:
This PR adds a feature to prevent same GPUDevice being assigned to two pods in the same PodGroup. This is needed to prevent NCCL communication failures between the two pods on the same physical device.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This feature is behind a feature flag and have been tested with a local minikube instance.
Testing
Built a custom scheduler image from the branch and deployed it with VGPUEnable: true and
VGPUPodGroupDeviceSpread: true. Faked two vGPU devices on the minikube node via
volcano.sh/node-vgpu-register and volcano.sh/node-vgpu-handshake annotations, and patched the node's
allocatable resources accordingly. Created a PodGroup with minMember: 2 and two pods each requesting one
vGPU.
With VGPUPodGroupDeviceSpread: true: the scheduler assigned the two pods to different GPU devices
(GPU-fake-...0001 and GPU-fake-...0002).
With VGPUPodGroupDeviceSpread: false: the scheduler assigned both pods to the same GPU device
(GPU-fake-...0002), confirming the flag correctly gates the behavior.
Does this PR introduce a user-facing change?