feat(gpu): implement gpu plugins #1008
Merged
xu282934741 merged 57 commits intokubewharf:mainfrom Mar 3, 2026
Merged
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1008 +/- ##
==========================================
+ Coverage 60.53% 60.82% +0.29%
==========================================
Files 700 737 +37
Lines 66807 69521 +2714
==========================================
+ Hits 40439 42284 +1845
- Misses 21809 22504 +695
- Partials 4559 4733 +174
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e57c066 to
3a1552c
Compare
4e39a97 to
969a658
Compare
eac1657 to
95fc12f
Compare
refactor: make allocation recursive to simplify logic refactor: make allocation recursive to simplify logic
- Add logging for missing device affinity provider - Fix negative NUMA node handling in GPU memory allocation - Simplify GPU topology lookup logic - Ensure proper handling of zero quantity resource requests
…ory strategy implement canonical strategy with separate filter and bind logic simplify gpu memory strategy by removing numa affinity check register canonical strategy as default allocation strategy
- Simplify affinity group structure by using sets.String for unallocated devices - Implement more efficient allocation strategy with priority-based processing - Remove unused types and consolidate allocation logic - Add helper methods for group evaluation and sorting - Improve error handling and early termination conditions
The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.
remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate move GetGPUCount call after initial checks in GetTopologyHints use deviceReq.DeviceRequest instead of gpuCount for memory calculation
filter out unhealthy gpu devices when calculating topology hints skip numa binding hints for non-numa binding requests
Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.
Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.
Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully
- Add DeviceName field to AllocationInfo struct to track GPU device names - Implement GetQuantityAllocatedWithFilter to support filtered allocation queries - Modify GPU memory plugin to consider device names during allocation - Remove NUMA binding check and use device name filtering instead
luomingmeng
reviewed
Feb 27, 2026
add ensureState method to handle state initialization and resource registration update all policy methods to use ensureState for consistent state management modify tests to verify lazy initialization behavior
luomingmeng
approved these changes
Mar 3, 2026
xu282934741
approved these changes
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer: