Skip to content

Comments

capacity: compute queue overused metric from final share instead of preemptiveFn#5051

Open
aadhil2k4 wants to merge 1 commit intovolcano-sh:masterfrom
aadhil2k4:capacityPlugin
Open

capacity: compute queue overused metric from final share instead of preemptiveFn#5051
aadhil2k4 wants to merge 1 commit intovolcano-sh:masterfrom
aadhil2k4:capacityPlugin

Conversation

@aadhil2k4
Copy link

What type of PR is this?

bug fix

What this PR does / why we need it:

The queue overused metric (UpdateQueueOverused) was previously updated inside AddPreemptiveFn, which runs per-task-per-queue during preemption. This caused the metric to be overwritten multiple times within a single scheduling cycle and reflect task-level preemption eligibility rather than actual queue overuse.

Additionally, the previous logic derived the metric from !isPreemptive, which semantically represented "cannot reclaim" instead of whether the queue exceeded its deserved resources.

Which issue(s) this PR fixes:

Fixes issue: 5048 - bug2

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Moves UpdateQueuOverUsed from AddPreemptiveFn to OnSessionClose.
Computes queue overused based on final queue share instead of task level preemption result.
No scheduling behavior changed.

Signed-off-by: Aadhil Ahamed <aadhil2k4@gmail.com>
Copilot AI review requested due to automatic review settings February 21, 2026 06:50
@gemini-code-assist
Copy link

Summary of Changes

Hello @aadhil2k4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug in the capacity scheduler plugin where the queue overuse metric was being incorrectly calculated and updated. The metric's update mechanism has been refactored to ensure it accurately reflects a queue's resource consumption relative to its deserved share, providing a more reliable indicator of system state.

Highlights

  • Metric Update Relocation: The UpdateQueueOverused metric call has been moved from the AddPreemptiveFn function to the OnSessionClose function. This ensures the metric is updated once per queue at the end of a scheduling session, rather than multiple times per task during preemption.
  • Overused Metric Logic Change: The logic for determining if a queue is 'overused' has been updated. Previously, it was derived from !isPreemptive, which indicated whether a task could reclaim resources. Now, it is calculated as attr.share > 1, directly reflecting whether the queue's final share exceeds its deserved resources.
  • Bug Fix: This change addresses a bug where the queue overused metric was being overwritten multiple times within a single scheduling cycle, leading to inaccurate reporting of queue overuse.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/scheduler/plugins/capacity/capacity.go
    • Removed the overused calculation and metrics.UpdateQueueOverused call from the AddPreemptiveFn function.
    • Added a loop within the OnSessionClose function to iterate through all queues and update their overused metric.
    • Modified the overused metric calculation to use attr.share > 1 for a more accurate representation of queue overuse.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 21, 2026
@aadhil2k4
Copy link
Author

cc: @hajnalmt Please review this PR and let me know if the approach is correct?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes how the queue overused metric is computed in the capacity plugin. Previously, the metric was updated inside AddPreemptiveFn, which is called per-task-per-queue during preemption evaluation, causing the metric to be overwritten multiple times within a single scheduling cycle. The metric also used the wrong semantic (!isPreemptive representing "cannot reclaim" rather than "overused").

The fix moves the metric update to OnSessionClose, where it's computed once per queue at the end of each scheduling session, and changes the logic to use attr.share > 1, which correctly represents whether a queue is using more than its deserved resources.

Changes:

  • Removed metric update from AddPreemptiveFn to prevent multiple overwrites per scheduling cycle
  • Added metric update in OnSessionClose to compute final overused state based on attr.share > 1
  • Changed semantic from task-level "cannot reclaim" to queue-level "exceeds deserved resources"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +396 to +399
for _, attr := range cp.queueOpts {
overused := attr.share > 1
metrics.UpdateQueueOverused(attr.name, overused)
}
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When hierarchical queues are enabled, ancestor queues have their allocated resources updated in AllocateFunc/DeallocateFunc (lines 368 and 385) but their share is never updated via updateShare(). This means ancestor queues will have outdated share values when the overused metric is computed here. The metric will not accurately reflect whether ancestor queues are overused.

To fix this, after updating ancestor allocated resources in the AllocateFunc and DeallocateFunc event handlers, you should also call cp.updateShare(ancestorAttr) for each ancestor queue.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@hajnalmt hajnalmt Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid but it's a separate issue I think and it's not a too important bug as we are not really intrested in ancestor queue shares only the leaf ones during scheduling.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug in how the queue_overused metric is calculated for the capacity plugin. Previously, the metric was updated inside AddPreemptiveFn on a per-task basis, leading to incorrect and overwritten values. The logic also incorrectly inferred overuse from preemption eligibility. By moving the metric update to OnSessionClose and calculating overuse based on the final queue share (attr.share > 1), the change ensures the metric accurately reflects whether a queue has exceeded its deserved resources at the end of a scheduling cycle. The implementation is clean and addresses the issue effectively.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/priority low
/area scheduling

Thanks!

Comment on lines +396 to +399
for _, attr := range cp.queueOpts {
overused := attr.share > 1
metrics.UpdateQueueOverused(attr.name, overused)
}
Copy link
Contributor

@hajnalmt hajnalmt Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid but it's a separate issue I think and it's not a too important bug as we are not really intrested in ancestor queue shares only the leaf ones during scheduling.

@volcano-sh-bot volcano-sh-bot added priority/low area/scheduling lgtm Indicates that a PR is ready to be merged. labels Feb 21, 2026
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hajnalmt
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/scheduling lgtm Indicates that a PR is ready to be merged. priority/low size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants