[WIP] feat(gpu): report gpu device topology to CNR #1042

JustinChengLZ · 2025-12-22T03:52:45Z

What type of PR is this?

Reporting of gpu device topology to CNR

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

This commit introduces a new static policy implementation for GPU resource management, including: - GPU topology provider and state management - Static policy implementation with allocation and deallocation logic - Integration with existing QRM framework - Metrics and health checks for GPU resource management

… or numa zone node

- Update GPU memory type from uint64 to float64 for precise allocation - Implement NUMA-aware GPU topology management and allocation - Add support for associated device allocation and topology hints - Introduce new GPU topology provider with NUMA node tracking - Extend GPU state management with NUMA node information - Add utility functions for GPU memory hint generation and NUMA calculations

The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.

Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes: - Adding new stub function type and default implementation - Extending the Stub struct with new field - Adding new methods for associated device operations

…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes

Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.

… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.

chore: add unit tests chore: add unit tests chore: add unit tests chore: add unit tests

…lugins feat: introduce rdma state and allow states to share within gpu sub-plugins feat: introduce rdma state and allow states to share within gpu sub-plugins

…ompany resource allocation feat: implement rdma custom device plugin and implement logic for accompany resource allocation

feat: implement allocation of accompany resource first before device

- Remove unused ResourcePluginsNames field and related configurations - Add DefaultAccompanyResourceName method to CustomDevicePlugin interface - Make registry maps private and add getter functions - Improve error handling and cleanup in StaticPolicy allocation - Simplify device topology initialization and allocation logic

introduce a new strategy framework for GPU allocation with filtering, sorting and binding components add helper functions for GPU memory and device allocation remove redundant checks and simplify allocation logic

restructure gpu allocation strategy into separate packages for better maintainability. move filtering, sorting and binding strategies to dedicated directories and implement unified generic allocation strategy. update manager to use new strategy structure and rename default strategy constant

Convert public strategy fields to private and provide getter/setter methods to maintain encapsulation while allowing controlled access to the strategies

Introduce DeviceAffinityGroup field to DeviceInfo struct to support device affinity grouping with priority levels.

feat: implement device affinity strategy

… allocation feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate feat: when device affinity of first priority is unable to decide allocation, go to next priority to allocate fix: simplify logic of unallocated devices and change name of field

- introduce DefaultResourceStateGeneratorRegistry for resource state generation - add SetResourceState method to state interface - move strategy registry to separate package - enhance GenericAllocationStrategy with dynamic strategy selection - update device topology registry with thread-safe operations - consolidate GPU and RDMA device plugin initialization - improve state checkpoint handling with resource state generators - add custom strategy configuration options

chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues chore: fix unit test, format and lint issues

fix: maintain affinity subgroup sequence in larger affinity groups

… function for allocating devices refactor: simplify code by deleting redundant parameters and refactor function for allocating devices

The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.

remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate move GetGPUCount call after initial checks in GetTopologyHints use deviceReq.DeviceRequest instead of gpuCount for memory calculation

filter out unhealthy gpu devices when calculating topology hints skip numa binding hints for non-numa binding requests

Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.

Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.

Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully

…logic

- Add DeviceName field to AllocationInfo struct to track GPU device names - Implement GetQuantityAllocatedWithFilter to support filtered allocation queries - Modify GPU memory plugin to consider device names during allocation - Remove NUMA binding check and use device name filtering instead

codecov · 2025-12-22T05:57:08Z

Codecov Report

❌ Patch coverage is 63.14452% with 1015 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.63%. Comparing base (8892043) to head (1c0255d).

Files with missing lines	Patch %	Lines
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go	51.48%	152 Missing and 28 partials ⚠️
...rm-plugins/gpu/resourceplugin/gpumemory/gpu_mem.go	63.90%	114 Missing and 30 partials ⚠️
...strategy/allocate/strategies/allocation/generic.go	10.71%	75 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state.go	63.63%	51 Missing and 13 partials ⚠️
...nt/qrm-plugins/gpu/baseplugin/reporter/reporter.go	55.72%	45 Missing and 13 partials ⚠️
...gent/qrm-plugins/gpu/customdeviceplugin/gpu/gpu.go	63.11%	39 Missing and 6 partials ⚠️
...kg/agent/qrm-plugins/gpu/state/state_checkpoint.go	64.28%	35 Missing and 10 partials ⚠️
...m-plugins/gpu/strategy/allocate/manager/manager.go	45.67%	40 Missing and 4 partials ⚠️
...nt/qrm-plugins/gpu/customdeviceplugin/rdma/rdma.go	72.61%	27 Missing and 16 partials ⚠️
pkg/util/machine/device.go	71.83%	29 Missing and 11 partials ⚠️
... and 22 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1042      +/-   ##
==========================================
+ Coverage   60.45%   60.63%   +0.18%     
==========================================
  Files         695      733      +38     
  Lines       65769    68500    +2731     
==========================================
+ Hits        39760    41538    +1778     
- Misses      21501    22282     +781     
- Partials     4508     4680     +172

Flag	Coverage Δ
unittest	`60.63% <63.14%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

feat(gpu): add device name tracking and allocation filtering

luomingmeng and others added 30 commits December 16, 2025 10:16

refactor(topology): skip add zone node which is not a child of socket…

a83c36a

… or numa zone node

refactor(cpu): remove unused preferredHintIndexes variable

6b318ff

The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.

refactor(qrm-plugins): embed UnimplementedResourcePluginServer in pol…

5e199c4

…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes

feat(gpu): add associated device topology hints support

f3cb8d6

Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.

fix typo and add logs

e4dca1d

refactor(gpu): remove redundant non-numa-affinity gpu allocation logic

f62209b

feat(gpu): optimize GPU allocation by preferring NUMA nodes with most…

2509a4e

… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.

feat: refactor code into resource plugins and custom device plugins

09b54e1

chore: add unit tests

05d9c19

chore: add unit tests chore: add unit tests chore: add unit tests chore: add unit tests

feat: introduce rdma state and allow states to share within gpu sub-p…

985c558

…lugins feat: introduce rdma state and allow states to share within gpu sub-plugins feat: introduce rdma state and allow states to share within gpu sub-plugins

feat: refactor state to only be in one file

2fee9c5

feat: implement rdma custom device plugin and implement logic for acc…

f317604

…ompany resource allocation feat: implement rdma custom device plugin and implement logic for accompany resource allocation

feat: implement allocation of accompany resource first before device

622f8a7

feat: implement allocation of accompany resource first before device

Update gpu_plugin.go

8f819b3

refactor: remove unused GenerateDummyGPUTopology function

687f52b

feat(gpu): implement strategy-based GPU allocation framework

0e35e38

introduce a new strategy framework for GPU allocation with filtering, sorting and binding components add helper functions for GPU memory and device allocation remove redundant checks and simplify allocation logic

refactor(gpu-strategy): make strategy fields private and add accessors

e779110

Convert public strategy fields to private and provide getter/setter methods to maintain encapsulation while allowing controlled access to the strategies

feat(device): add device affinity group support

f0c0728

Introduce DeviceAffinityGroup field to DeviceInfo struct to support device affinity grouping with priority levels.

feat: develop device affinity binding and filtering strategies

1a9aa2f

feat: implement device affinity strategy

chore: rebase katalyst-api

63fec15

fix: maintain affinity subgroup sequence in larger affinity groups

9eb6527

fix: maintain affinity subgroup sequence in larger affinity groups

refactor: simplify code by deleting redundant parameters and refactor…

e61d69d

… function for allocating devices refactor: simplify code by deleting redundant parameters and refactor function for allocating devices

JustinChengLZ and others added 12 commits December 16, 2025 10:17

fix: handling of nil device req

b4913cf

chore: add unit tests

93de2c1

refactor(gpumemory): move nil device request check after qos validation

06586c5

The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.

fix(gpu): skip zero requests in GetGPUCount and optimize logging

59df1b3

remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate move GetGPUCount call after initial checks in GetTopologyHints use deviceReq.DeviceRequest instead of gpuCount for memory calculation

feat(gpumemory): add numa binding check and health status filter

fea879b

filter out unhealthy gpu devices when calculating topology hints skip numa binding hints for non-numa binding requests

fix(gpumemory): handle unhealthy devices and correct capacity values

ed3c217

Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.

refactor(qrm): remove unused state file directory fields

9dd5ff8

Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.

fix(gpumemory): handle numa topology not ready case gracefully

2dbeb0b

Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully

chore: add context to interface methods

2af9f11

chore: add unit tests

9b4727f

feat(state): enhance gpu state management using refactored migration …

8176c42

…logic

JustinChengLZ requested review from WangZzzhe, caohe, luomingmeng, waynepeking348 and xu282934741 as code owners December 22, 2025 03:52

JustinChengLZ force-pushed the dev/report-gpu-topology-cnr branch 3 times, most recently from 60a6c30 to d11331f Compare December 22, 2025 05:41

JustinChengLZ force-pushed the dev/report-gpu-topology-cnr branch 6 times, most recently from 713cbeb to 1c0255d Compare December 23, 2025 07:07

JustinChengLZ added 2 commits December 23, 2025 23:42

Merge pull request #13 from luomingmeng/dev/fix-gpu-memory-get-hint

7bb2e24

feat(gpu): add device name tracking and allocation filtering

feat(cnr): report gpu device topology to cnr

3aacb80

JustinChengLZ force-pushed the dev/report-gpu-topology-cnr branch from 1c0255d to 3aacb80 Compare December 24, 2025 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] feat(gpu): report gpu device topology to CNR #1042

[WIP] feat(gpu): report gpu device topology to CNR #1042

Uh oh!

JustinChengLZ commented Dec 22, 2025

Uh oh!

codecov bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] feat(gpu): report gpu device topology to CNR #1042

Are you sure you want to change the base?

[WIP] feat(gpu): report gpu device topology to CNR #1042

Uh oh!

Conversation

JustinChengLZ commented Dec 22, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

codecov bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 22, 2025 •

edited

Loading