Skip to content

Conversation

@JustinChengLZ
Copy link
Collaborator

What type of PR is this?

  • Reporting of gpu device topology to CNR

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

luomingmeng and others added 30 commits December 16, 2025 10:16
This commit introduces a new static policy implementation for GPU resource management, including:
- GPU topology provider and state management
- Static policy implementation with allocation and deallocation logic
- Integration with existing QRM framework
- Metrics and health checks for GPU resource management
- Update GPU memory type from uint64 to float64 for precise allocation
- Implement NUMA-aware GPU topology management and allocation
- Add support for associated device allocation and topology hints
- Introduce new GPU topology provider with NUMA node tracking
- Extend GPU state management with NUMA node information
- Add utility functions for GPU memory hint generation and NUMA calculations
The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.
Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes:
- Adding new stub function type and default implementation
- Extending the Stub struct with new field
- Adding new methods for associated device operations
…icy structs

Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes
Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.
… allocated memory

Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.
chore: add unit tests

chore: add unit tests

chore: add unit tests

chore: add unit tests
…lugins

feat: introduce rdma state and allow states to share within gpu sub-plugins

feat: introduce rdma state and allow states to share within gpu sub-plugins
…ompany resource allocation

feat: implement rdma custom device plugin and implement logic for accompany resource allocation
feat: implement allocation of accompany resource first before device
- Remove unused ResourcePluginsNames field and related configurations
- Add DefaultAccompanyResourceName method to CustomDevicePlugin interface
- Make registry maps private and add getter functions
- Improve error handling and cleanup in StaticPolicy allocation
- Simplify device topology initialization and allocation logic
introduce a new strategy framework for GPU allocation with filtering, sorting and binding components
add helper functions for GPU memory and device allocation
remove redundant checks and simplify allocation logic
restructure gpu allocation strategy into separate packages for better maintainability. move filtering, sorting and binding strategies to dedicated directories and implement unified generic allocation strategy. update manager to use new strategy structure and rename default strategy constant
Convert public strategy fields to private and provide getter/setter methods
to maintain encapsulation while allowing controlled access to the strategies
Introduce DeviceAffinityGroup field to DeviceInfo struct to support device affinity grouping with priority levels.
feat: implement device affinity strategy
… allocation

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate

fix: simplify logic of unallocated devices and change name of field
- introduce DefaultResourceStateGeneratorRegistry for resource state generation
- add SetResourceState method to state interface
- move strategy registry to separate package
- enhance GenericAllocationStrategy with dynamic strategy selection
- update device topology registry with thread-safe operations
- consolidate GPU and RDMA device plugin initialization
- improve state checkpoint handling with resource state generators
- add custom strategy configuration options
chore: fix unit test, format and lint issues

chore: fix unit test, format and lint issues

chore: fix unit test, format and lint issues

chore: fix unit test, format and lint issues

chore: fix unit test, format and lint issues

chore: fix unit test, format and lint issues

chore: fix unit test, format and lint issues
fix: maintain affinity subgroup sequence in larger affinity groups
… function for allocating devices

refactor: simplify code by deleting redundant parameters and refactor function for allocating devices
JustinChengLZ and others added 12 commits December 16, 2025 10:17
The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.
remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate
move GetGPUCount call after initial checks in GetTopologyHints
use deviceReq.DeviceRequest instead of gpuCount for memory calculation
filter out unhealthy gpu devices when calculating topology hints
skip numa binding hints for non-numa binding requests
Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.
Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.
Return nil instead of error when numa topology is not ready and log the error

fix: handle error gracefully

fix: handle error gracefully
- Add DeviceName field to AllocationInfo struct to track GPU device names
- Implement GetQuantityAllocatedWithFilter to support filtered allocation queries
- Modify GPU memory plugin to consider device names during allocation
- Remove NUMA binding check and use device name filtering instead
@JustinChengLZ JustinChengLZ force-pushed the dev/report-gpu-topology-cnr branch 3 times, most recently from 60a6c30 to d11331f Compare December 22, 2025 05:41
@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 63.14452% with 1015 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.63%. Comparing base (8892043) to head (1c0255d).

Files with missing lines Patch % Lines
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go 51.48% 152 Missing and 28 partials ⚠️
...rm-plugins/gpu/resourceplugin/gpumemory/gpu_mem.go 63.90% 114 Missing and 30 partials ⚠️
...strategy/allocate/strategies/allocation/generic.go 10.71% 75 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state.go 63.63% 51 Missing and 13 partials ⚠️
...nt/qrm-plugins/gpu/baseplugin/reporter/reporter.go 55.72% 45 Missing and 13 partials ⚠️
...gent/qrm-plugins/gpu/customdeviceplugin/gpu/gpu.go 63.11% 39 Missing and 6 partials ⚠️
...kg/agent/qrm-plugins/gpu/state/state_checkpoint.go 64.28% 35 Missing and 10 partials ⚠️
...m-plugins/gpu/strategy/allocate/manager/manager.go 45.67% 40 Missing and 4 partials ⚠️
...nt/qrm-plugins/gpu/customdeviceplugin/rdma/rdma.go 72.61% 27 Missing and 16 partials ⚠️
pkg/util/machine/device.go 71.83% 29 Missing and 11 partials ⚠️
... and 22 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1042      +/-   ##
==========================================
+ Coverage   60.45%   60.63%   +0.18%     
==========================================
  Files         695      733      +38     
  Lines       65769    68500    +2731     
==========================================
+ Hits        39760    41538    +1778     
- Misses      21501    22282     +781     
- Partials     4508     4680     +172     
Flag Coverage Δ
unittest 60.63% <63.14%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JustinChengLZ JustinChengLZ force-pushed the dev/report-gpu-topology-cnr branch 6 times, most recently from 713cbeb to 1c0255d Compare December 23, 2025 07:07
@JustinChengLZ JustinChengLZ force-pushed the dev/report-gpu-topology-cnr branch from 1c0255d to 3aacb80 Compare December 24, 2025 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants