Skip to content

Conversation

@JustinChengLZ
Copy link
Collaborator

@JustinChengLZ JustinChengLZ commented Oct 29, 2025

What type of PR is this?

  • Implement framework for gpu plugins (custom device plugin and resource plugin interfaces)
  • Implement logic for gpu-memory and gpu allocation
  • Support device affinity when collecting device topology
  • Implement strategy framework for allocating devices

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

@codecov
Copy link

codecov bot commented Oct 29, 2025

Codecov Report

❌ Patch coverage is 63.32046% with 950 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.62%. Comparing base (7efab29) to head (2111b92).
⚠️ Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go 52.82% 155 Missing and 29 partials ⚠️
...rm-plugins/gpu/resourceplugin/gpumemory/gpu_mem.go 64.32% 113 Missing and 29 partials ⚠️
...strategy/allocate/strategies/allocation/generic.go 10.71% 75 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state.go 62.16% 54 Missing and 16 partials ⚠️
...kg/agent/qrm-plugins/gpu/state/state_checkpoint.go 58.10% 44 Missing and 18 partials ⚠️
...nt/qrm-plugins/gpu/baseplugin/reporter/reporter.go 55.72% 45 Missing and 13 partials ⚠️
...gent/qrm-plugins/gpu/customdeviceplugin/gpu/gpu.go 63.41% 39 Missing and 6 partials ⚠️
...m-plugins/gpu/strategy/allocate/manager/manager.go 45.67% 40 Missing and 4 partials ⚠️
pkg/util/machine/device.go 77.04% 19 Missing and 9 partials ⚠️
...trategy/allocate/strategies/deviceaffinity/bind.go 84.47% 19 Missing and 6 partials ⚠️
... and 20 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1008      +/-   ##
==========================================
+ Coverage   60.48%   60.62%   +0.14%     
==========================================
  Files         698      735      +37     
  Lines       66360    69182    +2822     
==========================================
+ Hits        40138    41942    +1804     
- Misses      21677    22508     +831     
- Partials     4545     4732     +187     
Flag Coverage Δ
unittest 60.62% <63.32%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 11 times, most recently from e57c066 to 3a1552c Compare November 5, 2025 01:41
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 8 times, most recently from 4e39a97 to 969a658 Compare November 12, 2025 08:12
@luomingmeng luomingmeng force-pushed the dev/support-gpu-plugins branch 5 times, most recently from eac1657 to 95fc12f Compare November 20, 2025 03:34
JustinChengLZ and others added 16 commits January 6, 2026 10:18
The nil device request check was moved to after qos validation to ensure proper resource allocation sequence. This change maintains the same behavior but improves the logical flow of the allocation process.
remove redundant gpuCount and gpuNames logging in GetTopologyHints and Allocate
move GetGPUCount call after initial checks in GetTopologyHints
use deviceReq.DeviceRequest instead of gpuCount for memory calculation
filter out unhealthy gpu devices when calculating topology hints
skip numa binding hints for non-numa binding requests
Set allocatable memory to zero for unhealthy GPU devices and use separate capacity values instead of reusing allocatable values. This ensures accurate resource accounting for both healthy and unhealthy devices.
Clean up GenericQRMPluginConfiguration by removing unused StateFileDirectory and InMemoryStateFileDirectory fields to simplify the struct.
Return nil instead of error when numa topology is not ready and log the error

fix: handle error gracefully

fix: handle error gracefully
- Add DeviceName field to AllocationInfo struct to track GPU device names
- Implement GetQuantityAllocatedWithFilter to support filtered allocation queries
- Modify GPU memory plugin to consider device names during allocation
- Remove NUMA binding check and use device name filtering instead
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 4 times, most recently from 9221413 to f72aa18 Compare January 6, 2026 07:34
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch 6 times, most recently from 6b4272b to eb87c70 Compare January 7, 2026 07:01
@JustinChengLZ JustinChengLZ force-pushed the dev/support-gpu-plugins branch from eb87c70 to 33b37df Compare January 7, 2026 07:10
workaround for fixing syncNics failure

Signed-off-by: 张浩宇 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

workflow/need-review review: test succeeded, need to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants