feat(gpu): add static policy implementation for GPU resource management #877

luomingmeng · 2025-06-20T14:22:07Z

What type of PR is this?

Features

What this PR does / why we need it:

This commit introduces a new static policy implementation for GPU resource management, including:

GPU topology provider and state management
Static policy implementation with allocation and deallocation logic
Integration with existing QRM framework
Metrics and health checks for GPU resource management

Which issue(s) this PR fixes:

Special notes for your reviewer:

codecov · 2025-06-20T14:34:01Z

Codecov Report

❌ Patch coverage is 2.48156% with 1454 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.60%. Comparing base (221a24d) to head (59ceef8).

Files with missing lines	Patch %	Lines
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go	0.00%	820 Missing ⚠️
...kg/agent/qrm-plugins/gpu/state/state_checkpoint.go	0.00%	159 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state.go	0.00%	147 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state_mem.go	0.00%	74 Missing ⚠️
pkg/util/machine/gpu.go	0.00%	62 Missing ⚠️
pkg/agent/qrm-plugins/gpu/util/util.go	0.00%	52 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/util.go	0.00%	43 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/checkpoint.go	0.00%	18 Missing ⚠️
pkg/agent/orm/endpoint/resource_plugin_stub.go	19.04%	17 Missing ⚠️
cmd/katalyst-agent/app/agent/qrm/gpu_plugin.go	11.11%	16 Missing ⚠️
... and 5 more

❌ Your patch check has failed because the patch coverage (1.45%) is below the target coverage (50.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #877      +/-   ##
==========================================
- Coverage   58.61%   57.60%   -1.01%     
==========================================
  Files         678      689      +11     
  Lines       76623    78089    +1466     
==========================================
+ Hits        44909    44983      +74     
- Misses      27385    28770    +1385     
- Partials     4329     4336       +7

Flag	Coverage Δ
unittest	`57.60% <2.48%> (-1.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This commit introduces a new static policy implementation for GPU resource management, including: - GPU topology provider and state management - Static policy implementation with allocation and deallocation logic - Integration with existing QRM framework - Metrics and health checks for GPU resource management

… or numa zone node

- Update GPU memory type from uint64 to float64 for precise allocation - Implement NUMA-aware GPU topology management and allocation - Add support for associated device allocation and topology hints - Introduce new GPU topology provider with NUMA node tracking - Extend GPU state management with NUMA node information - Add utility functions for GPU memory hint generation and NUMA calculations

The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.

Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes: - Adding new stub function type and default implementation - Extending the Stub struct with new field - Adding new methods for associated device operations

…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes

Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.

… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.

JustinChengLZ · 2025-09-30T06:30:17Z

pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go

+		PreStartRequired:      false,
+		WithTopologyAlignment: true,
+		NeedReconcile:         false,
+		AssociatedDevices:     p.resourceNames.List(),


Since this is referring to associated devices, should we name resourceNames differently? Maybe associatedDeviceNames? Because resourceNames can be misunderstood as gpu_memory or gpu_core etc

luomingmeng requested review from waynepeking348 and xu282934741 as code owners June 20, 2025 14:22

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 16 times, most recently from d7051af to 1030b14 Compare June 21, 2025 18:27

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch from 1030b14 to 1c9d17b Compare July 18, 2025 07:52

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 7 times, most recently from 86cabed to 5780ab1 Compare July 25, 2025 15:15

luomingmeng requested review from WangZzzhe and caohe as code owners July 25, 2025 15:15

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch from 5780ab1 to b4e0de9 Compare July 25, 2025 15:43

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 3 times, most recently from 67d2867 to 9ed86ac Compare July 28, 2025 03:23

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 10 times, most recently from ff2101e to 9845f58 Compare August 5, 2025 02:44

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch from 9845f58 to e15fcc5 Compare August 28, 2025 03:59

luomingmeng added 7 commits September 18, 2025 21:15

refactor(topology): skip add zone node which is not a child of socket…

dfd2c83

… or numa zone node

refactor(cpu): remove unused preferredHintIndexes variable

c373e18

The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.

refactor(qrm-plugins): embed UnimplementedResourcePluginServer in pol…

bff34b5

…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes

feat(gpu): add associated device topology hints support

f681d85

Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 2 times, most recently from 6d9068d to 8073741 Compare September 22, 2025 11:13

luomingmeng added 2 commits September 22, 2025 19:22

fix typo and add logs

790eb2e

refactor(gpu): remove redundant non-numa-affinity gpu allocation logic

044aa69

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 2 times, most recently from 5362e7f to 5d42f05 Compare September 22, 2025 13:05

feat(gpu): optimize GPU allocation by preferring NUMA nodes with most…

59ceef8

… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.

luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch from 5d42f05 to 59ceef8 Compare September 22, 2025 13:26

JustinChengLZ reviewed Sep 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gpu): add static policy implementation for GPU resource management #877

feat(gpu): add static policy implementation for GPU resource management #877

Uh oh!

luomingmeng commented Jun 20, 2025

Uh oh!

codecov bot commented Jun 20, 2025 •

edited

Loading

Uh oh!

JustinChengLZ Sep 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(gpu): add static policy implementation for GPU resource management #877

Are you sure you want to change the base?

feat(gpu): add static policy implementation for GPU resource management #877

Uh oh!

Conversation

luomingmeng commented Jun 20, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

codecov bot commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JustinChengLZ Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jun 20, 2025 •

edited

Loading

JustinChengLZ Sep 30, 2025 •

edited

Loading