fix(capacity): ensure allocated resources remain above deserved after evction #4936

guoqinwill · 2026-01-12T07:48:14Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a cyclic eviction issue in the capacity plugin where pods could be repeatedly evicted and rescheduled in a loop, causing scheduling instability.

Which issue(s) this PR fixes:

Fixes # #4947 subBug 1:

The capacity plugin could enter a cyclic eviction loop in the following scenario:

Example:

Queue1: deserved=3c, has a running pod1 requesting 4c
Queue2: deserved=5c, has a running pod2 requesting 1c, and a pending pod3 requesting 4c

The Cycle:

Reclaim phase: Queue2's pod3 triggers eviction of Queue1's pod1 (since Queue1 is over quota: 4c > 3c)
Post-eviction: Queue1's allocated becomes 0, share = 0/3 = 0
Allocate phase: Queue1 gets higher priority (share=0 < Queue2's share=0.2)
Scheduling: pod1 gets rescheduled to Queue1, allocated = 4c again
Repeat: Queue2's pending pod triggers eviction again → cycle repeats

Root Cause

The issue occurs because the ReclaimableFn only checks if the current allocated > deserved, but doesn't verify whether eviction would cause allocated to drop below deserved. This allows over-reclaim, making the queue immediately "starving" and eligible for high-priority scheduling.

Solution

Add a check in the capacity plugin's ReclaimableFn to prevent eviction if it would cause the queue's allocated resources to fall below its deserved quota in any resource dimension.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Recycling is not allowed if it causes the queue's allocated resources to fall below its deserved quota. That is, if evicting a pod from the target queue would result in allocated < deserved for the queue, then the eviction of this pod is denied.

volcano-sh-bot · 2026-01-12T07:48:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/scheduler/plugins/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-01-12T07:48:29Z

Summary of Changes

Hello @guoqinwill, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial fix to the capacity scheduler plugin, resolving a cyclic eviction issue that could cause significant scheduling instability. By refining the resource reclamation logic, the system now prevents scenarios where pods are repeatedly evicted and rescheduled due to an immediate over-reclamation of resources. This enhancement ensures more stable and predictable resource allocation, improving the overall reliability of the scheduler.

Highlights

Cyclic Eviction Fix: Addresses a critical bug in the capacity plugin where pods could be repeatedly evicted and rescheduled in a loop, leading to scheduling instability.
Post-Eviction Allocation Check: Implements a new check within the ReclaimableFn to prevent eviction if it would cause a queue's allocated resources to drop below its deserved quota in any resource dimension.
Resource Over-Reclamation Prevention: Ensures that the capacity plugin avoids over-reclaiming resources, which previously led to queues becoming immediately 'starving' and eligible for high-priority scheduling after an eviction.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request provides a solid fix for a cyclic eviction issue within the capacity plugin. The problem, where a queue becomes "starving" post-eviction and gets the same pod rescheduled, is addressed by preventing evictions that would drop a queue's allocation below its deserved resources. The change in ReclaimableFn using exceptReclaimee.LessPartly(attr.deserved, api.Zero) is correct and directly targets the root cause. The PR is well-documented, and the fix should improve scheduling stability. I have one minor suggestion to improve comment clarity.

pkg/scheduler/plugins/capacity/capacity.go

hajnalmt

Nice catch, I didn't realize this yet. This issue seems valid.

I am refactoring this part of the code now, and we will be conflicting:
#4919
Can you help me with a review here first? This is solving as bad issues as this.

I am not sure too that this is a right way to solve this issue. I think the pod should be evicted in this case, and we should fix allocate somehow that if the pod will go above deserved with the workload than we shall prioritize others. I am trying thinking about it, but it's hard to stuff this into our QueueOrderFn...

If we try to go with this one liner, these kind of pods won't be evicted from the queue like never... Even though they are using resources above deserved... Maybe we shall create an issue first and brainstorm in it?

hajnalmt · 2026-01-12T10:59:17Z

This can help on this too:
#4780
What we need is kind of simulating allocation in reclaim, so for these cases the reclaimer won't evict this kind of reclaimee.

guoqinwill · 2026-01-12T11:21:40Z

Nice catch, I didn't realize this yet. This issue seems valid.

I am refactoring this part of the code now, and we will be conflicting: #4919 Can you help me with a review here first? This is solving as bad issues as this.

I am not sure too that this is a right way to solve this issue. I think the pod should be evicted in this case, and we should fix allocate somehow that if the pod will go above deserved with the workload than we shall prioritize others. I am trying thinking about it, but it's hard to stuff this into our QueueOrderFn...

If we try to go with this one liner, these kind of pods won't be evicted from the queue like never... Even though they are using resources above deserved... Maybe we shall create an issue first and brainstorm in it?

Yes. I have looked at your PR, and I understand that our points of modification don't really conflict much. Your PR mainly addresses the issue that there should be an intersection between reclaimer and deserved resources (it’s just that the modification boundary here is changed from guarantee to deserved:

volcano/pkg/scheduler/plugins/capacity/capacity.go

Lines 139 to 142 in 34ae225

    
           reclaimable = attr.guarantee.LessEqual(exceptReclaimee, api.Zero) 
        
           if !reclaimable { 
        
           	continue 
        
           }

). My modifications mainly deal with the boundary issue of the reclaimee. If the current reclaimee being evicted causes the allocated amount of the queue to fall below the deserved value, I understand that it indeed should not be evicted in the first place, because we shouldn’t make other queues fall below their deserved value just to allow the reclaimer to reach its queue’s deserved value.

volcano-sh-bot · 2026-01-12T11:22:09Z

@guoqinwill: No presubmit jobs available for volcano-sh/volcano@master

Details

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JesseStutler · 2026-01-12T13:00:47Z

pkg/scheduler/plugins/capacity/capacity.go

+			// Skip reclaim in two cases:
+			// 1. Current allocated <= deserved (queue not over-quota yet)
+			// 2. Evicting would cause allocated < deserved in any dimension (prevent cyclic eviction)
+			if allocated.LessEqual(attr.deserved, api.Infinity) || exceptReclaimee.LessPartly(attr.deserved, api.Zero) {


I'm curious about current implementation, but this will block pod3 to reclaim from pod1, isn't it?

JesseStutler · 2026-01-12T13:02:25Z

@guoqinwill Please also help review #4919, I think we should unify together to solve all of these issues

hajnalmt

I think we shall create an issue for this first. Can you create it @guoqinwill (as you found the bug it would be maybe better)? Or shall I do one?

Let me break my point down:

My modifications mainly deal with the boundary issue of the reclaimee. If the current reclaimee being evicted causes the allocated amount of the queue to fall below the deserved value, I understand that it indeed should not be evicted in the first place, because we shouldn’t make other queues fall below their deserved value just to allow the reclaimer to reach its queue’s deserved value.

I understand your change, but I think we should 😄 I wrote we should evict, and we should allow the queue to fall below its deserved as it was consuming above it.

If I take your example:

Example:
Queue1: deserved=3c, has a running pod1 requesting 4c
Queue2: deserved=5c, has a running pod2 requesting 1c, and a pending pod3 requesting 4c

The Cycle:

Reclaim phase: Queue2's pod3 triggers eviction of Queue1's pod1 (since Queue1 is over quota: 4c > 3c)
Post-eviction: Queue1's allocated becomes 0, share = 0/3 = 0
Allocate phase: Queue1 gets higher priority (share=0 < Queue2's share=0.2)
Scheduling: pod1 gets rescheduled to Queue1, allocated = 4c again
Repeat: Queue2's pending pod triggers eviction again → cycle repeats

This is a nice problem description! This is a problem, let's continue:

Root Cause
The issue occurs because the ReclaimableFn only checks if the current allocated > deserved, but doesn't verify whether eviction would cause allocated to drop below deserved. This allows over-reclaim, making the queue immediately "starving" and eligible for high-priority scheduling.

This is where I am arguing, this is not true you can attack this problem at several other points in the cycle, not just at this comparision, and I actually think this is not the right point to do it.
This change won't let queues to be reclaimed that are consuming above their deserved (which is not right). I think Pod1 in Queue1 is an absolutely valid victim candidate.

For example the problem can be attacked in the reclaim process itself:

If the reclaim process wouldn't evict pods that are later not being allocated this issue wouldn't occur. (See my comment: #4936 (comment)) In this case Pod1 in Queue1 in your example is still a valid victim candidate, but since pod2 wouldn't be allocated, the reclaim process shall find an other victim candidate to fulfill the reclaimer (pod2) request. -> This is an actual issue for other occuring bugs too, not just this one, reclaim currently does not know what allocate does in the end. We should find a way to simulate allocate in reclaim.

I still have a problem with the latter approach. Mainly that I think between the two pods (pod1 and pod2) pod2 should be scheduled instead of pod1 as queue2 will stay below deserved if it gets scheduled. I understand that since we are trying to decide between att.share directly and only this is not possible now.

Maybe we shall introduce Job and Task ordering in the capacity plugin which actually considers the tasks in the job in queue.attr.share? So pod1 can be scheduled somehow instead of pod2.

I am wondering if this is feasible somehow. Shall we continue on slack or something?

If you have some time please review these first:

#4927
#4919
🙏

JesseStutler · 2026-01-13T01:23:24Z

I think we shall create an issue for this first. Can you create it @guoqinwill (as you found the bug it would be maybe better)? Or shall I do one?

Let me break my point down:

My modifications mainly deal with the boundary issue of the reclaimee. If the current reclaimee being evicted causes the allocated amount of the queue to fall below the deserved value, I understand that it indeed should not be evicted in the first place, because we shouldn’t make other queues fall below their deserved value just to allow the reclaimer to reach its queue’s deserved value.

I understand your change, but I think we should 😄 I wrote we should evict, and we should allow the queue to fall below its deserved as it was consuming above it.

If I take your example:

Example:
Queue1: deserved=3c, has a running pod1 requesting 4c
Queue2: deserved=5c, has a running pod2 requesting 1c, and a pending pod3 requesting 4c
The Cycle:
Reclaim phase: Queue2's pod3 triggers eviction of Queue1's pod1 (since Queue1 is over quota: 4c > 3c)
Post-eviction: Queue1's allocated becomes 0, share = 0/3 = 0
Allocate phase: Queue1 gets higher priority (share=0 < Queue2's share=0.2)
Scheduling: pod1 gets rescheduled to Queue1, allocated = 4c again
Repeat: Queue2's pending pod triggers eviction again → cycle repeats

This is a nice problem description! This is a problem, let's continue:

Root Cause
The issue occurs because the ReclaimableFn only checks if the current allocated > deserved, but doesn't verify whether eviction would cause allocated to drop below deserved. This allows over-reclaim, making the queue immediately "starving" and eligible for high-priority scheduling.

This is where I am arguing, this is not true you can attack this problem at several other points in the cycle, not just at this comparision, and I actually think this is not the right point to do it. This change won't let queues to be reclaimed that are consuming above their deserved (which is not right). I think Pod1 in Queue1 is an absolutely valid victim candidate.

For example the problem can be attacked in the reclaim process itself:

If the reclaim process wouldn't evict pods that are later not being allocated this issue wouldn't occur. (See my comment: fix(capacity): prevent cyclic eviction by checking post-eviction allocation #4936 (comment)) In this case Pod1 in Queue1 in your example is still a valid victim candidate, but since pod2 wouldn't be allocated, the reclaim process shall find an other victim candidate to fulfill the reclaimer (pod2) request. -> This is an actual issue for other occuring bugs too, not just this one, reclaim currently does not know what allocate does in the end. We should find a way to simulate allocate in reclaim.

I still have a problem with the latter approach. Mainly that I think between the two pods (pod1 and pod2) pod2 should be scheduled instead of pod1 as queue2 will stay below deserved if it gets scheduled. I understand that since we are trying to decide between att.share directly and only this is not possible now.

Maybe we shall introduce Job and Task ordering in the capacity plugin which actually considers the tasks in the job in queue.attr.share? So pod1 can be scheduled somehow instead of pod2.

I am wondering if this is feasible somehow. Shall we continue on slack or something?

If you have some time please review these first:

feat(resource_info): Greater comparision logic, Intersection functions #4927

fix/feat(capacity): AddReclaimableFn refactor #4919
🙏

I think the root cause of this PR is that the reclaimed pod1 is prioritized to schedule again in the next session due to the queue's priority. Alternatively, we should consider giving the reclaimer a better scheduling opportunity(higher priority) in the next session or reducing the scheduling opportunity(lower priority) for the reclaimee in the next session. I have asked @guoqinwill to also help review #4919, but she is likely not on Slack. @hajnalmt , but BTW, thanks mate for these detailed concern

hajnalmt · 2026-01-13T11:20:29Z

Yes! I am wondering, if it makes sense to do stmt.Allocate in reclaim action for the reclaimer even?
We are doing Allocate in backfill action too it's not like the Allocate statement is used in allocate only. Why can't we do it for the reclaimer then?

guoqinwill · 2026-01-13T12:14:25Z

Yes! I am wondering, if it makes sense to do stmt.Allocate in reclaim action for the reclaimer even? We are doing Allocate in backfill action too it's not like the Allocate statement is used in allocate only. Why can't we do it for the reclaimer then?

I understand that backfill itself is performing scheduling tasks, mainly handling best-effort type pods, so there's no need to consider the remaining resources of the cluster. However, after reclaim triggers pod eviction in the current session, the evicted pods cannot be deleted immediately, meaning the node resources they use are not released right away (which may require a long wait), so they cannot be allocated directly.

hajnalmt · 2026-01-13T17:21:46Z

Yes, you are right, this was a silly idea. We shouldn't overcommit node resources without explicit oversubscription enabled

Do we currently have any mechanism that prioritizes Pipelined tasks before others in the allocate action? It seems logical to process Pipelined tasks first, since they've already "won" their resources through preempt, reclaim, or even a previous allocate cycle. While this wouldn't fully solve our problem as other tasks could still hijack resources while evicted pods are being released on the node, it would at least mitigate the issue.

… evction Signed-off-by: guoqinwill <[email protected]>

guoqinwill · 2026-01-15T03:03:20Z

Yes, you are right, this was a silly idea. We shouldn't overcommit node resources without explicit oversubscription enabled

Do we currently have any mechanism that prioritizes Pipelined tasks before others in the allocate action? It seems logical to process Pipelined tasks first, since they've already "won" their resources through preempt, reclaim, or even a previous allocate cycle. While this wouldn't fully solve our problem as other tasks could still hijack resources while evicted pods are being released on the node, it would at least mitigate the issue.

In fact, we have considered other approaches to this issue, but unfortunately, there is currently no better solution. Although the current modification will prevent the reclamation of boundary pods that cross the queue's deserved value, it can help prevent more problems, right? Reclaiming is inherently a lazy, best-effort action. From our current perspective, allowing the reclamation of these boundary pods carries a greater cost than not reclaiming them.

Firstly, if we allow the reclamation of these boundary pods, it could result in the target queue's allocated value being less than its deserved value. When new pods arrive in this queue, they can still reclaim from other queues. If this is the case, why not just keep the previously reclaimed pods?

Secondly, there is the issue I mentioned regarding cyclic eviction and reclamation. I have already raised an issue here, you can check #4947, and we can further discuss this issue in the future.

In fact, in the scenario of reclaiming boundary pods, a conservative approach is more appropriate than an aggressive one.

hajnalmt · 2026-01-15T08:28:02Z

Thank you for raising the issue! Subbug2 is a great catch too.

Edit: This is not as bad as I thought in the first place as deserved can't go below allocated now only.

Reclaiming is inherently a lazy, best-effort action. From our current perspective,
allowing the reclamation of these boundary pods carries a greater cost than not reclaiming them.

It feels like we are introducing a bug here (at least for me), and solving an other which will maybe never come up for me, or if it does it will resolve itself after some time as jobs are finishing on the cluster.

In fact, in the scenario of reclaiming boundary pods, a conservative approach is more appropriate than an aggressive one.

I agree with this, but I think this solution is kind of aggresively conservative.

Firstly, if we allow the reclamation of these boundary pods, it could result in the target queue's allocated value being less than its deserved value. When new pods arrive in this queue, they can still reclaim from other queues. If this is the case, why not just keep the previously reclaimed pods?

Because with that pod we were above deserved, so we should have reclaimed it.

hajnalmt · 2026-01-15T23:32:12Z

Another idea is that we could fine-tune this to be less conservative.

At the very end of the reclaim process after line 155 we check your comparision attr.deserved.LessEqualWithDimensionAndResourcesName(exceptReclaimee, reclaimer.Resreq) and if it's true (meaning the queue would drop below its deserved), we invoke the QueueOrder functions on exceptReclaimee-s queue (reclaimee subtracted) and reclaimer-s queue and if the priority from the order is higher for the reclaimee we don't reclaim. This won't let pod1 to be reclaimed in the issue example, but will let it to be reclaimed by other higher priority queues. Essentially "simulates" the allocate-phase problem by reusing QueueOrderFn logic inside the reclaim decision.

This would require extracting the queue comparison logic from the embedded QueueOrderFn into a reusable function (e.g., compareQueues(lQueue, rQueue *api.QueueInfo) int), which would also improve code organization.

volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 12, 2026

volcano-sh-bot requested review from archlitchi and william-wang January 12, 2026 07:48

volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 12, 2026

gemini-code-assist bot reviewed Jan 12, 2026

View reviewed changes

pkg/scheduler/plugins/capacity/capacity.go Outdated Show resolved Hide resolved

guoqinwill force-pushed the fix-capacity-1 branch from 4819ee9 to 6d1add3 Compare January 12, 2026 08:15

volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 12, 2026

hajnalmt suggested changes Jan 12, 2026

View reviewed changes

guoqinwill force-pushed the fix-capacity-1 branch from 6d1add3 to 3ff8fbb Compare January 12, 2026 10:56

JesseStutler reviewed Jan 12, 2026

View reviewed changes

hajnalmt reviewed Jan 12, 2026

View reviewed changes

fix(capacity): ensure allocated resources remain above deserved after…

ee4d601

… evction Signed-off-by: guoqinwill <[email protected]>

guoqinwill force-pushed the fix-capacity-1 branch from 3ff8fbb to ee4d601 Compare January 15, 2026 02:51

volcano-sh-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2026

guoqinwill changed the title ~~fix(capacity): prevent cyclic eviction by checking post-eviction allocation~~ fix(capacity): ensure allocated resources remain above deserved after evction Jan 15, 2026

fix(capacity): ensure allocated resources remain above deserved after evction #4936

Are you sure you want to change the base?

fix(capacity): ensure allocated resources remain above deserved after evction #4936

Uh oh!

Conversation

guoqinwill commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Example:

The Cycle:

Root Cause

Solution

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Jan 12, 2026

Uh oh!

gemini-code-assist bot commented Jan 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hajnalmt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hajnalmt commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqinwill commented Jan 12, 2026

Uh oh!

volcano-sh-bot commented Jan 12, 2026

Uh oh!

JesseStutler Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

JesseStutler commented Jan 12, 2026

Uh oh!

hajnalmt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JesseStutler commented Jan 13, 2026

Uh oh!

hajnalmt commented Jan 13, 2026

Uh oh!

guoqinwill commented Jan 13, 2026

Uh oh!

hajnalmt commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqinwill commented Jan 15, 2026

Uh oh!

hajnalmt commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hajnalmt commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guoqinwill commented Jan 12, 2026 •

edited

Loading

hajnalmt left a comment •

edited

Loading

hajnalmt commented Jan 12, 2026 •

edited

Loading

hajnalmt left a comment •

edited

Loading

hajnalmt commented Jan 13, 2026 •

edited

Loading

hajnalmt commented Jan 15, 2026 •

edited

Loading

hajnalmt commented Jan 15, 2026 •

edited

Loading