capacity: compute queue overused metric from final share instead of preemptiveFn#5051
capacity: compute queue overused metric from final share instead of preemptiveFn#5051aadhil2k4 wants to merge 1 commit intovolcano-sh:masterfrom
Conversation
Moves UpdateQueuOverUsed from AddPreemptiveFn to OnSessionClose. Computes queue overused based on final queue share instead of task level preemption result. No scheduling behavior changed. Signed-off-by: Aadhil Ahamed <aadhil2k4@gmail.com>
Summary of ChangesHello @aadhil2k4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a bug in the capacity scheduler plugin where the queue overuse metric was being incorrectly calculated and updated. The metric's update mechanism has been refactored to ensure it accurately reflects a queue's resource consumption relative to its deserved share, providing a more reliable indicator of system state. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
cc: @hajnalmt Please review this PR and let me know if the approach is correct? |
There was a problem hiding this comment.
Pull request overview
This PR fixes how the queue overused metric is computed in the capacity plugin. Previously, the metric was updated inside AddPreemptiveFn, which is called per-task-per-queue during preemption evaluation, causing the metric to be overwritten multiple times within a single scheduling cycle. The metric also used the wrong semantic (!isPreemptive representing "cannot reclaim" rather than "overused").
The fix moves the metric update to OnSessionClose, where it's computed once per queue at the end of each scheduling session, and changes the logic to use attr.share > 1, which correctly represents whether a queue is using more than its deserved resources.
Changes:
- Removed metric update from
AddPreemptiveFnto prevent multiple overwrites per scheduling cycle - Added metric update in
OnSessionCloseto compute final overused state based onattr.share > 1 - Changed semantic from task-level "cannot reclaim" to queue-level "exceeds deserved resources"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for _, attr := range cp.queueOpts { | ||
| overused := attr.share > 1 | ||
| metrics.UpdateQueueOverused(attr.name, overused) | ||
| } |
There was a problem hiding this comment.
When hierarchical queues are enabled, ancestor queues have their allocated resources updated in AllocateFunc/DeallocateFunc (lines 368 and 385) but their share is never updated via updateShare(). This means ancestor queues will have outdated share values when the overused metric is computed here. The metric will not accurately reflect whether ancestor queues are overused.
To fix this, after updating ancestor allocated resources in the AllocateFunc and DeallocateFunc event handlers, you should also call cp.updateShare(ancestorAttr) for each ancestor queue.
There was a problem hiding this comment.
This is valid but it's a separate issue I think and it's not a too important bug as we are not really intrested in ancestor queue shares only the leaf ones during scheduling.
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug in how the queue_overused metric is calculated for the capacity plugin. Previously, the metric was updated inside AddPreemptiveFn on a per-task basis, leading to incorrect and overwritten values. The logic also incorrectly inferred overuse from preemption eligibility. By moving the metric update to OnSessionClose and calculating overuse based on the final queue share (attr.share > 1), the change ensures the metric accurately reflects whether a queue has exceeded its deserved resources at the end of a scheduling cycle. The implementation is clean and addresses the issue effectively.
hajnalmt
left a comment
There was a problem hiding this comment.
/lgtm
/priority low
/area scheduling
Thanks!
| for _, attr := range cp.queueOpts { | ||
| overused := attr.share > 1 | ||
| metrics.UpdateQueueOverused(attr.name, overused) | ||
| } |
There was a problem hiding this comment.
This is valid but it's a separate issue I think and it's not a too important bug as we are not really intrested in ancestor queue shares only the leaf ones during scheduling.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hajnalmt The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
What type of PR is this?
bug fix
What this PR does / why we need it:
The queue overused metric (UpdateQueueOverused) was previously updated inside AddPreemptiveFn, which runs per-task-per-queue during preemption. This caused the metric to be overwritten multiple times within a single scheduling cycle and reflect task-level preemption eligibility rather than actual queue overuse.
Additionally, the previous logic derived the metric from
!isPreemptive, which semantically represented "cannot reclaim" instead of whether the queue exceeded its deserved resources.Which issue(s) this PR fixes:
Fixes issue: 5048 - bug2
Special notes for your reviewer:
Does this PR introduce a user-facing change?