feat(gpu): implement gpu plugins #1008

JustinChengLZ · 2025-10-29T04:12:54Z

What type of PR is this?

Implement framework for gpu plugins (custom device plugin and resource plugin interfaces)
Implement logic for gpu-memory and gpu allocation
Support device affinity when collecting device topology
Implement strategy framework for allocating devices

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

codecov · 2025-10-29T05:54:35Z

Codecov Report

❌ Patch coverage is 63.32046% with 950 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.62%. Comparing base (7efab29) to head (2111b92).
⚠️ Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go	52.82%	155 Missing and 29 partials ⚠️
...rm-plugins/gpu/resourceplugin/gpumemory/gpu_mem.go	64.32%	113 Missing and 29 partials ⚠️
...strategy/allocate/strategies/allocation/generic.go	10.71%	75 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state.go	62.16%	54 Missing and 16 partials ⚠️
...kg/agent/qrm-plugins/gpu/state/state_checkpoint.go	58.10%	44 Missing and 18 partials ⚠️
...nt/qrm-plugins/gpu/baseplugin/reporter/reporter.go	55.72%	45 Missing and 13 partials ⚠️
...gent/qrm-plugins/gpu/customdeviceplugin/gpu/gpu.go	63.41%	39 Missing and 6 partials ⚠️
...m-plugins/gpu/strategy/allocate/manager/manager.go	45.67%	40 Missing and 4 partials ⚠️
pkg/util/machine/device.go	77.04%	19 Missing and 9 partials ⚠️
...trategy/allocate/strategies/deviceaffinity/bind.go	84.47%	19 Missing and 6 partials ⚠️
... and 20 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1008      +/-   ##
==========================================
+ Coverage   60.48%   60.62%   +0.14%     
==========================================
  Files         698      735      +37     
  Lines       66360    69182    +2822     
==========================================
+ Hits        40138    41942    +1804     
- Misses      21677    22508     +831     
- Partials     4545     4732     +187

Flag	Coverage Δ
unittest	`60.62% <63.32%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.

remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate move GetGPUCount call after initial checks in GetTopologyHints use deviceReq.DeviceRequest instead of gpuCount for memory calculation

filter out unhealthy gpu devices when calculating topology hints skip numa binding hints for non-numa binding requests

Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.

Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.

Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully

…logic

- Add DeviceName field to AllocationInfo struct to track GPU device names - Implement GetQuantityAllocatedWithFilter to support filtered allocation queries - Modify GPU memory plugin to consider device names during allocation - Remove NUMA binding check and use device name filtering instead

…face

workaround for fixing syncNics failure Signed-off-by: 张浩宇 <[email protected]>

JustinChengLZ requested review from WangZzzhe, caohe, luomingmeng, waynepeking348 and xu282934741 as code owners October 29, 2025 04:12

JustinChengLZ force-pushed the dev/support-gpu-plugins branch 11 times, most recently from e57c066 to 3a1552c Compare November 5, 2025 01:41

JustinChengLZ force-pushed the dev/support-gpu-plugins branch 8 times, most recently from 4e39a97 to 969a658 Compare November 12, 2025 08:12

luomingmeng force-pushed the dev/support-gpu-plugins branch 5 times, most recently from eac1657 to 95fc12f Compare November 20, 2025 03:34

JustinChengLZ and others added 16 commits January 6, 2026 10:18

refactor: simplify the grouping of device affinity

a7bc8aa

fix: handling of nil device req

363e5c4

chore: add unit tests

ff89850

refactor(gpumemory): move nil device request check after qos validation

7f609c5

The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.

fix(gpu): skip zero requests in GetGPUCount and optimize logging

6c5eca8

remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate move GetGPUCount call after initial checks in GetTopologyHints use deviceReq.DeviceRequest instead of gpuCount for memory calculation

feat(gpumemory): add numa binding check and health status filter

77fc4b3

filter out unhealthy gpu devices when calculating topology hints skip numa binding hints for non-numa binding requests

fix(gpumemory): handle unhealthy devices and correct capacity values

004ef6f

Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.

refactor(qrm): remove unused state file directory fields

5f20bde

Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.

fix(gpumemory): handle numa topology not ready case gracefully

b4bedfd

Return nil instead of error when numa topology is not ready and log the error fix: handle error gracefully fix: handle error gracefully

chore: add context to interface methods

14f0b0a

chore: add unit tests

ce5569d

feat(state): enhance gpu state management using refactored migration …

b462dcd

…logic

feat(cnr): report gpu device topology to cnr

8ede67a

fix: do not store state when getting topology hints

025f5e8

fix: corner case bug

1f42267

JustinChengLZ force-pushed the dev/support-gpu-plugins branch 4 times, most recently from 9221413 to f72aa18 Compare January 6, 2026 07:34

fix: remove dependency on kubelet checkpoint file

428e18d

JustinChengLZ force-pushed the dev/support-gpu-plugins branch 6 times, most recently from 6b4272b to eb87c70 Compare January 7, 2026 07:01

fix: state nil handling and name change of custom device plugin inter…

33b37df

…face

JustinChengLZ force-pushed the dev/support-gpu-plugins branch from eb87c70 to 33b37df Compare January 7, 2026 07:10

fix(irq-tuning): workaround for fixing syncNics failure

2111b92

workaround for fixing syncNics failure Signed-off-by: 张浩宇 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gpu): implement gpu plugins #1008

feat(gpu): implement gpu plugins #1008

Uh oh!

JustinChengLZ commented Oct 29, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(gpu): implement gpu plugins #1008

Are you sure you want to change the base?

feat(gpu): implement gpu plugins #1008

Uh oh!

Conversation

JustinChengLZ commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

codecov bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JustinChengLZ commented Oct 29, 2025 •

edited

Loading

codecov bot commented Oct 29, 2025 •

edited

Loading