Skip to content

Conversation

@luomingmeng
Copy link
Collaborator

What type of PR is this?

Features

What this PR does / why we need it:

This commit introduces a new static policy implementation for GPU resource management, including:

  • GPU topology provider and state management
  • Static policy implementation with allocation and deallocation logic
  • Integration with existing QRM framework
  • Metrics and health checks for GPU resource management

Which issue(s) this PR fixes:

Special notes for your reviewer:

@codecov
Copy link

codecov bot commented Jun 20, 2025

Codecov Report

❌ Patch coverage is 2.48156% with 1454 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.60%. Comparing base (221a24d) to head (59ceef8).

Files with missing lines Patch % Lines
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go 0.00% 820 Missing ⚠️
...kg/agent/qrm-plugins/gpu/state/state_checkpoint.go 0.00% 159 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state.go 0.00% 147 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/state_mem.go 0.00% 74 Missing ⚠️
pkg/util/machine/gpu.go 0.00% 62 Missing ⚠️
pkg/agent/qrm-plugins/gpu/util/util.go 0.00% 52 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/util.go 0.00% 43 Missing ⚠️
pkg/agent/qrm-plugins/gpu/state/checkpoint.go 0.00% 18 Missing ⚠️
pkg/agent/orm/endpoint/resource_plugin_stub.go 19.04% 17 Missing ⚠️
cmd/katalyst-agent/app/agent/qrm/gpu_plugin.go 11.11% 16 Missing ⚠️
... and 5 more

❌ Your patch check has failed because the patch coverage (1.45%) is below the target coverage (50.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #877      +/-   ##
==========================================
- Coverage   58.61%   57.60%   -1.01%     
==========================================
  Files         678      689      +11     
  Lines       76623    78089    +1466     
==========================================
+ Hits        44909    44983      +74     
- Misses      27385    28770    +1385     
- Partials     4329     4336       +7     
Flag Coverage Δ
unittest 57.60% <2.48%> (-1.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 16 times, most recently from d7051af to 1030b14 Compare June 21, 2025 18:27
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch from 1030b14 to 1c9d17b Compare July 18, 2025 07:52
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 7 times, most recently from 86cabed to 5780ab1 Compare July 25, 2025 15:15
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch from 5780ab1 to b4e0de9 Compare July 25, 2025 15:43
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 3 times, most recently from 67d2867 to 9ed86ac Compare July 28, 2025 03:23
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 10 times, most recently from ff2101e to 9845f58 Compare August 5, 2025 02:44
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch from 9845f58 to e15fcc5 Compare August 28, 2025 03:59
This commit introduces a new static policy implementation for GPU resource management, including:
- GPU topology provider and state management
- Static policy implementation with allocation and deallocation logic
- Integration with existing QRM framework
- Metrics and health checks for GPU resource management
- Update GPU memory type from uint64 to float64 for precise allocation
- Implement NUMA-aware GPU topology management and allocation
- Add support for associated device allocation and topology hints
- Introduce new GPU topology provider with NUMA node tracking
- Extend GPU state management with NUMA node information
- Add utility functions for GPU memory hint generation and NUMA calculations
The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.
Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes:
- Adding new stub function type and default implementation
- Extending the Stub struct with new field
- Adding new methods for associated device operations
…icy structs

Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes
Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 2 times, most recently from 6d9068d to 8073741 Compare September 22, 2025 11:13
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch 2 times, most recently from 5362e7f to 5d42f05 Compare September 22, 2025 13:05
… allocated memory

Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.
@luomingmeng luomingmeng force-pushed the dev/support-gpu-memory-qrm-plugin branch from 5d42f05 to 59ceef8 Compare September 22, 2025 13:26
PreStartRequired: false,
WithTopologyAlignment: true,
NeedReconcile: false,
AssociatedDevices: p.resourceNames.List(),
Copy link
Collaborator

@JustinChengLZ JustinChengLZ Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is referring to associated devices, should we name resourceNames differently? Maybe associatedDeviceNames? Because resourceNames can be misunderstood as gpu_memory or gpu_core etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants