-
Notifications
You must be signed in to change notification settings - Fork 117
feat(gpu): add static policy implementation for GPU resource management #877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(gpu): add static policy implementation for GPU resource management #877
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (1.45%) is below the target coverage (50.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #877 +/- ##
==========================================
- Coverage 58.61% 57.60% -1.01%
==========================================
Files 678 689 +11
Lines 76623 78089 +1466
==========================================
+ Hits 44909 44983 +74
- Misses 27385 28770 +1385
- Partials 4329 4336 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d7051af to
1030b14
Compare
1030b14 to
1c9d17b
Compare
86cabed to
5780ab1
Compare
5780ab1 to
b4e0de9
Compare
67d2867 to
9ed86ac
Compare
ff2101e to
9845f58
Compare
9845f58 to
e15fcc5
Compare
This commit introduces a new static policy implementation for GPU resource management, including: - GPU topology provider and state management - Static policy implementation with allocation and deallocation logic - Integration with existing QRM framework - Metrics and health checks for GPU resource management
… or numa zone node
- Update GPU memory type from uint64 to float64 for precise allocation - Implement NUMA-aware GPU topology management and allocation - Add support for associated device allocation and topology hints - Introduce new GPU topology provider with NUMA node tracking - Extend GPU state management with NUMA node information - Add utility functions for GPU memory hint generation and NUMA calculations
The preferredHintIndexes variable was declared but never used in the code. Removing it improves code clarity and maintainability.
Add new functionality to handle associated device allocation requests in the resource plugin stub. This includes: - Adding new stub function type and default implementation - Extending the Stub struct with new field - Adding new methods for associated device operations
…icy structs Add pluginapi.UnimplementedResourcePluginServer to all policy structs to ensure forward compatibility with gRPC interface changes
Implement GetAssociatedDeviceTopologyHints method in StaticPolicy and update stub to handle topology hints requests. Also update kubelet dependency version and rename mustIncludeDevices to reusableDevices for clarity.
6d9068d to
8073741
Compare
5362e7f to
5d42f05
Compare
… allocated memory Add tracking of allocated GPU memory per NUMA node and modify hint calculation to prefer nodes with most allocated memory. This helps balance GPU memory usage across NUMA nodes.
5d42f05 to
59ceef8
Compare
| PreStartRequired: false, | ||
| WithTopologyAlignment: true, | ||
| NeedReconcile: false, | ||
| AssociatedDevices: p.resourceNames.List(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is referring to associated devices, should we name resourceNames differently? Maybe associatedDeviceNames? Because resourceNames can be misunderstood as gpu_memory or gpu_core etc
What type of PR is this?
Features
What this PR does / why we need it:
This commit introduces a new static policy implementation for GPU resource management, including:
Which issue(s) this PR fixes:
Special notes for your reviewer: