Skip to content

feat(gpu_memory): support milligpu allocation logic#1131

Open
JustinChengLZ wants to merge 16 commits intokubewharf:mainfrom
JustinChengLZ:dev/milligpu-allocation
Open

feat(gpu_memory): support milligpu allocation logic#1131
JustinChengLZ wants to merge 16 commits intokubewharf:mainfrom
JustinChengLZ:dev/milligpu-allocation

Conversation

@JustinChengLZ
Copy link
Copy Markdown
Collaborator

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 73.59199% with 422 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.87%. Comparing base (d8c28ea) to head (c5a48f7).

Files with missing lines Patch % Lines
...ugins/gpu/resourceplugin/virtualgpu/virtual_gpu.go 74.82% 103 Missing and 40 partials ⚠️
...ugins/memory/dynamicpolicy/policy_hint_handlers.go 66.06% 40 Missing and 16 partials ⚠️
pkg/agent/qrm-plugins/memory/dynamicpolicy/util.go 22.22% 48 Missing and 1 partial ⚠️
...g/agent/qrm-plugins/memory/dynamicpolicy/policy.go 66.66% 32 Missing and 9 partials ⚠️
...memory/dynamicpolicy/policy_allocation_handlers.go 79.08% 30 Missing and 11 partials ⚠️
cmd/katalyst-agent/app/options/qrm/gpu_plugin.go 45.94% 19 Missing and 1 partial ⚠️
pkg/agent/qrm-plugins/util/util.go 75.40% 11 Missing and 4 partials ⚠️
...nt/qrm-plugins/memory/dynamicpolicy/state/state.go 52.00% 9 Missing and 3 partials ⚠️
pkg/agent/qrm-plugins/gpu/staticpolicy/policy.go 46.66% 7 Missing and 1 partial ⚠️
...-plugins/gpu/strategy/allocate/manager/defaults.go 14.28% 2 Missing and 4 partials ⚠️
... and 11 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1131      +/-   ##
==========================================
+ Coverage   61.71%   61.87%   +0.16%     
==========================================
  Files         786      788       +2     
  Lines       73689    74377     +688     
==========================================
+ Hits        45476    46020     +544     
- Misses      23191    23288      +97     
- Partials     5022     5069      +47     
Flag Coverage Δ
unittest 61.87% <73.59%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JustinChengLZ JustinChengLZ force-pushed the dev/milligpu-allocation branch 5 times, most recently from f859476 to a31b04a Compare April 25, 2026 14:00
@luomingmeng luomingmeng force-pushed the dev/milligpu-allocation branch 2 times, most recently from f4fb8a0 to 572ddda Compare April 28, 2026 06:25
JustinChengLZ and others added 11 commits April 30, 2026 00:50
Add DiscoverMemoryTopology function to calculate normal memory capacity by excluding static hugepages. This enables proper memory accounting in NUMA systems with hugepages configured. The function is now used by the existing Discover method to maintain consistent memory topology information.
Update memory calculations to use NormalMemoryCapacity and NormalMemoryDetails instead of total memory capacity to exclude static hugepages. This provides more accurate resource upper bounds and per-NUMA capacity calculations for memory provisioning and headroom estimation.
Add StaticHugePagesDetails and StaticHugePagesCapacity fields to MemoryTopology
to track static hugepages allocation per NUMA node and system-wide. Update test
cases and dummy topology generation to include these new fields.
Static huge pages should not be included in the total available memory calculation as they are reserved and cannot be used for dynamic allocation. This change updates the policy to properly account for static huge pages and adds corresponding test cases to verify the behavior.
…lation

Add GetMemoryTopology interface to ReadonlyState and implement memory topology support in state management. This enables more precise memory capacity calculations by considering NormalMemoryDetails while excluding hugepages. The change affects multiple state generation functions across the codebase to properly handle memory topology information.
feat(qrm-plugin): support reclaimed cores vpa
@luomingmeng luomingmeng force-pushed the dev/milligpu-allocation branch 2 times, most recently from 493bb3e to 949d240 Compare May 3, 2026 13:09
JustinChengLZ and others added 3 commits May 3, 2026 22:34
…cation logic

- Rename all fractional GPU references to virtual GPU for consistency
- Optimize topology hint calculation with pre-computed resource contexts
- Improve NUMA node selection logic for packing/spreading strategies
- Update environment variable names and config fields to reflect virtual GPU terminology
- Add comprehensive tests for spreading mode and edge cases
…andling and logging

Add input validation and detailed logging for device allocation process
Include pod information in logs for better traceability
@luomingmeng luomingmeng force-pushed the dev/milligpu-allocation branch from 949d240 to c5a48f7 Compare May 3, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants