feat(gpu_memory): support milligpu allocation logic by JustinChengLZ · Pull Request #1131 · kubewharf/katalyst-core

JustinChengLZ · 2026-04-21T06:09:43Z

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

codecov · 2026-04-21T09:14:14Z

Codecov Report

❌ Patch coverage is 73.59199% with 422 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.87%. Comparing base (d8c28ea) to head (c5a48f7).

Files with missing lines	Patch %	Lines
...ugins/gpu/resourceplugin/virtualgpu/virtual_gpu.go	74.82%	103 Missing and 40 partials ⚠️
...ugins/memory/dynamicpolicy/policy_hint_handlers.go	66.06%	40 Missing and 16 partials ⚠️
pkg/agent/qrm-plugins/memory/dynamicpolicy/util.go	22.22%	48 Missing and 1 partial ⚠️
...g/agent/qrm-plugins/memory/dynamicpolicy/policy.go	66.66%	32 Missing and 9 partials ⚠️
...memory/dynamicpolicy/policy_allocation_handlers.go	79.08%	30 Missing and 11 partials ⚠️
cmd/katalyst-agent/app/options/qrm/gpu_plugin.go	45.94%	19 Missing and 1 partial ⚠️
pkg/agent/qrm-plugins/util/util.go	75.40%	11 Missing and 4 partials ⚠️
...nt/qrm-plugins/memory/dynamicpolicy/state/state.go	52.00%	9 Missing and 3 partials ⚠️
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go	46.66%	7 Missing and 1 partial ⚠️
...-plugins/gpu/strategy/allocate/manager/defaults.go	14.28%	2 Missing and 4 partials ⚠️
... and 11 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1131      +/-   ##
==========================================
+ Coverage   61.71%   61.87%   +0.16%     
==========================================
  Files         786      788       +2     
  Lines       73689    74377     +688     
==========================================
+ Hits        45476    46020     +544     
- Misses      23191    23288      +97     
- Partials     5022     5069      +47

Flag	Coverage Δ
unittest	`61.87% <73.59%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add DiscoverMemoryTopology function to calculate normal memory capacity by excluding static hugepages. This enables proper memory accounting in NUMA systems with hugepages configured. The function is now used by the existing Discover method to maintain consistent memory topology information.

Update memory calculations to use NormalMemoryCapacity and NormalMemoryDetails instead of total memory capacity to exclude static hugepages. This provides more accurate resource upper bounds and per-NUMA capacity calculations for memory provisioning and headroom estimation.

Add StaticHugePagesDetails and StaticHugePagesCapacity fields to MemoryTopology to track static hugepages allocation per NUMA node and system-wide. Update test cases and dummy topology generation to include these new fields.

Static huge pages should not be included in the total available memory calculation as they are reserved and cannot be used for dynamic allocation. This change updates the policy to properly account for static huge pages and adds corresponding test cases to verify the behavior.

…lation Add GetMemoryTopology interface to ReadonlyState and implement memory topology support in state management. This enables more precise memory capacity calculations by considering NormalMemoryDetails while excluding hugepages. The change affects multiple state generation functions across the codebase to properly handle memory topology information.

feat(qrm-plugin): support reclaimed cores vpa

…ugin

…cation logic - Rename all fractional GPU references to virtual GPU for consistency - Optimize topology hint calculation with pre-computed resource contexts - Improve NUMA node selection logic for packing/spreading strategies - Update environment variable names and config fields to reflect virtual GPU terminology - Add comprehensive tests for spreading mode and edge cases

…andling and logging Add input validation and detailed logging for device allocation process Include pod information in logs for better traceability

JustinChengLZ requested review from cheney-lin, luomingmeng, sun-yuliang, waynepeking348 and xu282934741 as code owners April 21, 2026 06:09

JustinChengLZ force-pushed the dev/milligpu-allocation branch 2 times, most recently from e639157 to 040bb88 Compare April 21, 2026 06:16

JustinChengLZ force-pushed the dev/milligpu-allocation branch 5 times, most recently from f859476 to a31b04a Compare April 25, 2026 14:00

luomingmeng force-pushed the dev/milligpu-allocation branch 2 times, most recently from f4fb8a0 to 572ddda Compare April 28, 2026 06:25

JustinChengLZ and others added 11 commits April 30, 2026 00:50

feat: implement hugepages allocation

5164653

chore: fix build issues

56268b0

fix: update allocation for distribute evenly across numa

12ad66e

fix: error when getting reserved memory

ec14d5c

feat(qrm-plugin): support reclaimed cores vpa

fix: panic when nil

1a02098

feat: support allocation of extra resources in gpu memory resource pl…

14df430

…ugin

luomingmeng force-pushed the dev/milligpu-allocation branch 2 times, most recently from 493bb3e to 949d240 Compare May 3, 2026 13:09

JustinChengLZ added 2 commits May 3, 2026 22:34

refactor allocation logic to use strategy and simplify topology hints

6f130d6

feat: setting of env variables for weight of gpu memory and milligpu

3ec70d4

JustinChengLZ and others added 3 commits May 3, 2026 22:34

feat: support scheduler gpu allocation

976551a

refactor(gpu): improve device allocation strategy with better error h…

c5a48f7

…andling and logging Add input validation and detailed logging for device allocation process Include pod information in logs for better traceability

luomingmeng force-pushed the dev/milligpu-allocation branch from 949d240 to c5a48f7 Compare May 3, 2026 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gpu_memory): support milligpu allocation logic#1131

feat(gpu_memory): support milligpu allocation logic#1131
JustinChengLZ wants to merge 16 commits intokubewharf:mainfrom
JustinChengLZ:dev/milligpu-allocation

JustinChengLZ commented Apr 21, 2026

Uh oh!

codecov Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JustinChengLZ commented Apr 21, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Uh oh!

codecov Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 21, 2026 •

edited

Loading