Skip to content

feat(gpu): implement gpu plugins #1008

Merged
xu282934741 merged 57 commits intokubewharf:mainfrom
JustinChengLZ:dev/support-gpu-plugins
Mar 3, 2026
Merged

feat(gpu): implement gpu plugins #1008
xu282934741 merged 57 commits intokubewharf:mainfrom
JustinChengLZ:dev/support-gpu-plugins

Conversation

@JustinChengLZ
Copy link
Copy Markdown
Collaborator

@JustinChengLZ JustinChengLZ commented Oct 29, 2025

What type of PR is this?

  • Implement framework for gpu plugins (custom device plugin and resource plugin interfaces)
  • Implement logic for gpu-memory and gpu allocation
  • Support device affinity when collecting device topology
  • Implement strategy framework for allocating devices

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 29, 2025

Codecov Report

❌ Patch coverage is 63.29491% with 987 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.82%. Comparing base (716974a) to head (bdd283d).
⚠️ Report is 70 commits behind head on main.

Files with missing lines Patch % Lines
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go 48.78% 174 Missing and 36 partials ⚠️
...rm-plugins/gpu/resourceplugin/gpumemory/gpu_mem.go 64.32% 113 Missing and 29 partials ⚠️
...strategy/allocate/strategies/allocation/generic.go 10.71% 75 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state.go 62.16% 54 Missing and 16 partials ⚠️
...kg/agent/qrm-plugins/gpu/state/state_checkpoint.go 54.72% 48 Missing and 19 partials ⚠️
...nt/qrm-plugins/gpu/baseplugin/reporter/reporter.go 53.67% 51 Missing and 12 partials ⚠️
...gent/qrm-plugins/gpu/customdeviceplugin/gpu/gpu.go 63.41% 39 Missing and 6 partials ⚠️
...m-plugins/gpu/strategy/allocate/manager/manager.go 45.67% 40 Missing and 4 partials ⚠️
pkg/util/machine/device.go 80.00% 24 Missing and 10 partials ⚠️
...trategy/allocate/strategies/deviceaffinity/bind.go 84.47% 19 Missing and 6 partials ⚠️
... and 20 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1008      +/-   ##
==========================================
+ Coverage   60.53%   60.82%   +0.29%     
==========================================
  Files         700      737      +37     
  Lines       66807    69521    +2714     
==========================================
+ Hits        40439    42284    +1845     
- Misses      21809    22504     +695     
- Partials     4559     4733     +174     
Flag Coverage Δ
unittest 60.82% <63.29%> (+0.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 11 times, most recently from e57c066 to 3a1552c Compare November 5, 2025 01:41
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 8 times, most recently from 4e39a97 to 969a658 Compare November 12, 2025 08:12
@luomingmeng luomingmeng force-pushed the dev/support-gpu-plugins branch 5 times, most recently from eac1657 to 95fc12f Compare November 20, 2025 03:34
JustinChengLZ and others added 22 commits February 26, 2026 17:44
refactor: make allocation recursive to simplify logic

refactor: make allocation recursive to simplify logic
- Add logging for missing device affinity provider
- Fix negative NUMA node handling in GPU memory allocation
- Simplify GPU topology lookup logic
- Ensure proper handling of zero quantity resource requests
…ory strategy

implement canonical strategy with separate filter and bind logic
simplify gpu memory strategy by removing numa affinity check
register canonical strategy as default allocation strategy
- Simplify affinity group structure by using sets.String for unallocated devices
- Implement more efficient allocation strategy with priority-based processing
- Remove unused types and consolidate allocation logic
- Add helper methods for group evaluation and sorting
- Improve error handling and early termination conditions
The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.
remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate
move GetGPUCount call after initial checks in GetTopologyHints
use deviceReq.DeviceRequest instead of gpuCount for memory calculation
filter out unhealthy gpu devices when calculating topology hints
skip numa binding hints for non-numa binding requests
Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.
Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.
Return nil instead of error when numa topology is not ready and log the error

fix: handle error gracefully

fix: handle error gracefully
- Add DeviceName field to AllocationInfo struct to track GPU device names
- Implement GetQuantityAllocatedWithFilter to support filtered allocation queries
- Modify GPU memory plugin to consider device names during allocation
- Remove NUMA binding check and use device name filtering instead
Comment thread pkg/agent/qrm-plugins/gpu/customdeviceplugin/gpu/gpu.go Outdated
JustinChengLZ and others added 5 commits March 3, 2026 11:18
add ensureState method to handle state initialization and resource registration
update all policy methods to use ensureState for consistent state management
modify tests to verify lazy initialization behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

workflow/need-review review: test succeeded, need to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants