-
Notifications
You must be signed in to change notification settings - Fork 117
[WIP] feat(gpu): report gpu device topology to CNR #1042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
JustinChengLZ
wants to merge
49
commits into
kubewharf:main
Choose a base branch
from
JustinChengLZ:dev/report-gpu-topology-cnr
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[WIP] feat(gpu): report gpu device topology to CNR #1042
JustinChengLZ
wants to merge
49
commits into
kubewharf:main
from
JustinChengLZ:dev/report-gpu-topology-cnr
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit introduces a new static policy implementation for GPU resource management, including: - GPU topology provider and state management - Static policy implementation with allocation and deallocation logic - Integration with existing QRM framework - Metrics and health checks for GPU resource management
… or numa zone node
- Update GPU memory type from uint64 to float64 for precise allocation - Implement NUMA-aware GPU topology management and allocation - Add support for associated device allocation and topology hints - Introduce new GPU topology provider with NUMA node tracking - Extend GPU state management with NUMA node information - Add utility functions for GPU memory hint generation and NUMA calculations
The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.
Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes: - Adding new stub function type and default implementation - Extending the Stub struct with new field - Adding new methods for associated device operations
…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes
Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.
… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.
chore: add unit tests chore: add unit tests chore: add unit tests chore: add unit tests
…lugins feat: introduce rdma state and allow states to share within gpu sub-plugins feat: introduce rdma state and allow states to share within gpu sub-plugins
…ompany resource allocation feat: implement rdma custom device plugin and implement logic for accompany resource allocation
feat: implement allocation of accompany resource first before device
- Remove unused ResourcePluginsNames field and related configurations - Add DefaultAccompanyResourceName method to CustomDevicePlugin interface - Make registry maps private and add getter functions - Improve error handling and cleanup in StaticPolicy allocation - Simplify device topology initialization and allocation logic
introduce a new strategy framework for GPU allocation with filtering, sorting and binding components add helper functions for GPU memory and device allocation remove redundant checks and simplify allocation logic
restructure gpu allocation strategy into separate packages for better maintainability. move filtering, sorting and binding strategies to dedicated directories and implement unified generic allocation strategy. update manager to use new strategy structure and rename default strategy constant
Convert public strategy fields to private and provide getter/setter methods to maintain encapsulation while allowing controlled access to the strategies
Introduce DeviceAffinityGroup field to DeviceInfo struct to support device affinity grouping with priority levels.
feat: implement device affinity strategy
… allocation feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate fix: simplify logic of unallocated devices and change name of field
- introduce DefaultResourceStateGeneratorRegistry for resource state generation - add SetResourceState method to state interface - move strategy registry to separate package - enhance GenericAllocationStrategy with dynamic strategy selection - update device topology registry with thread-safe operations - consolidate GPU and RDMA device plugin initialization - improve state checkpoint handling with resource state generators - add custom strategy configuration options
chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues
fix: maintain affinity subgroup sequence in larger affinity groups
… function for allocating devices refactor: simplify code by deleting redundant parameters and refactor function for allocating devices
The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.
remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate move GetGPUCount call after initial checks in GetTopologyHints use deviceReq.DeviceRequest instead of gpuCount for memory calculation
filter out unhealthy gpu devices when calculating topology hints skip numa binding hints for non-numa binding requests
Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.
Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.
Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully
- Add DeviceName field to AllocationInfo struct to track GPU device names - Implement GetQuantityAllocatedWithFilter to support filtered allocation queries - Modify GPU memory plugin to consider device names during allocation - Remove NUMA binding check and use device name filtering instead
60a6c30 to
d11331f
Compare
713cbeb to
1c0255d
Compare
feat(gpu): add device name tracking and allocation filtering
1c0255d to
3aacb80
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer: