-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix(capacity): ensure allocated resources remain above deserved after evction #4936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @guoqinwill, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial fix to the capacity scheduler plugin, resolving a cyclic eviction issue that could cause significant scheduling instability. By refining the resource reclamation logic, the system now prevents scenarios where pods are repeatedly evicted and rescheduled due to an immediate over-reclamation of resources. This enhancement ensures more stable and predictable resource allocation, improving the overall reliability of the scheduler. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a solid fix for a cyclic eviction issue within the capacity plugin. The problem, where a queue becomes "starving" post-eviction and gets the same pod rescheduled, is addressed by preventing evictions that would drop a queue's allocation below its deserved resources. The change in ReclaimableFn using exceptReclaimee.LessPartly(attr.deserved, api.Zero) is correct and directly targets the root cause. The PR is well-documented, and the fix should improve scheduling stability. I have one minor suggestion to improve comment clarity.
4819ee9 to
6d1add3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I didn't realize this yet. This issue seems valid.
I am refactoring this part of the code now, and we will be conflicting:
#4919
Can you help me with a review here first? This is solving as bad issues as this.
I am not sure too that this is a right way to solve this issue. I think the pod should be evicted in this case, and we should fix allocate somehow that if the pod will go above deserved with the workload than we shall prioritize others. I am trying thinking about it, but it's hard to stuff this into our QueueOrderFn...
If we try to go with this one liner, these kind of pods won't be evicted from the queue like never... Even though they are using resources above deserved... Maybe we shall create an issue first and brainstorm in it?
6d1add3 to
3ff8fbb
Compare
|
This can help on this too: |
Yes. I have looked at your PR, and I understand that our points of modification don't really conflict much. Your PR mainly addresses the issue that there should be an intersection between reclaimer and deserved resources (it’s just that the modification boundary here is changed from guarantee to deserved: volcano/pkg/scheduler/plugins/capacity/capacity.go Lines 139 to 142 in 34ae225
|
|
@guoqinwill: No presubmit jobs available for volcano-sh/volcano@master DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| // Skip reclaim in two cases: | ||
| // 1. Current allocated <= deserved (queue not over-quota yet) | ||
| // 2. Evicting would cause allocated < deserved in any dimension (prevent cyclic eviction) | ||
| if allocated.LessEqual(attr.deserved, api.Infinity) || exceptReclaimee.LessPartly(attr.deserved, api.Zero) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about current implementation, but this will block pod3 to reclaim from pod1, isn't it?
|
@guoqinwill Please also help review #4919, I think we should unify together to solve all of these issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shall create an issue for this first. Can you create it @guoqinwill (as you found the bug it would be maybe better)? Or shall I do one?
Let me break my point down:
My modifications mainly deal with the boundary issue of the reclaimee. If the current reclaimee being evicted causes the allocated amount of the queue to fall below the deserved value, I understand that it indeed should not be evicted in the first place, because we shouldn’t make other queues fall below their deserved value just to allow the reclaimer to reach its queue’s deserved value.
I understand your change, but I think we should 😄 I wrote we should evict, and we should allow the queue to fall below its deserved as it was consuming above it.
If I take your example:
Example:
Queue1: deserved=3c, has a running pod1 requesting 4c
Queue2: deserved=5c, has a running pod2 requesting 1c, and a pending pod3 requesting 4cThe Cycle:
Reclaim phase: Queue2's pod3 triggers eviction of Queue1's pod1 (since Queue1 is over quota: 4c > 3c)
Post-eviction: Queue1's allocated becomes 0, share = 0/3 = 0
Allocate phase: Queue1 gets higher priority (share=0 < Queue2's share=0.2)
Scheduling: pod1 gets rescheduled to Queue1, allocated = 4c again
Repeat: Queue2's pending pod triggers eviction again → cycle repeats
This is a nice problem description! This is a problem, let's continue:
Root Cause
The issue occurs because the ReclaimableFn only checks if the current allocated > deserved, but doesn't verify whether eviction would cause allocated to drop below deserved. This allows over-reclaim, making the queue immediately "starving" and eligible for high-priority scheduling.
This is where I am arguing, this is not true you can attack this problem at several other points in the cycle, not just at this comparision, and I actually think this is not the right point to do it.
This change won't let queues to be reclaimed that are consuming above their deserved (which is not right). I think Pod1 in Queue1 is an absolutely valid victim candidate.
For example the problem can be attacked in the reclaim process itself:
- If the reclaim process wouldn't evict pods that are later not being allocated this issue wouldn't occur. (See my comment: #4936 (comment)) In this case Pod1 in Queue1 in your example is still a valid victim candidate, but since pod2 wouldn't be allocated, the reclaim process shall find an other victim candidate to fulfill the reclaimer (pod2) request. -> This is an actual issue for other occuring bugs too, not just this one, reclaim currently does not know what allocate does in the end. We should find a way to simulate allocate in reclaim.
I still have a problem with the latter approach. Mainly that I think between the two pods (pod1 and pod2) pod2 should be scheduled instead of pod1 as queue2 will stay below deserved if it gets scheduled. I understand that since we are trying to decide between att.share directly and only this is not possible now.
- Maybe we shall introduce
JobandTaskordering in the capacity plugin which actually considers the tasks in the job inqueue.attr.share? So pod1 can be scheduled somehow instead of pod2.
I am wondering if this is feasible somehow. Shall we continue on slack or something?
If you have some time please review these first:
I think the root cause of this PR is that the reclaimed pod1 is prioritized to schedule again in the next session due to the queue's priority. Alternatively, we should consider giving the reclaimer a better scheduling opportunity(higher priority) in the next session or reducing the scheduling opportunity(lower priority) for the reclaimee in the next session. I have asked @guoqinwill to also help review #4919, but she is likely not on Slack. @hajnalmt , but BTW, thanks mate for these detailed concern |
|
Yes! I am wondering, if it makes sense to do |
I understand that backfill itself is performing scheduling tasks, mainly handling best-effort type pods, so there's no need to consider the remaining resources of the cluster. However, after reclaim triggers pod eviction in the current session, the evicted pods cannot be deleted immediately, meaning the node resources they use are not released right away (which may require a long wait), so they cannot be allocated directly. |
|
Yes, you are right, this was a silly idea. We shouldn't overcommit node resources without explicit oversubscription enabled Do we currently have any mechanism that prioritizes |
… evction Signed-off-by: guoqinwill <[email protected]>
3ff8fbb to
ee4d601
Compare
In fact, we have considered other approaches to this issue, but unfortunately, there is currently no better solution. Although the current modification will prevent the reclamation of boundary pods that cross the queue's deserved value, it can help prevent more problems, right? Reclaiming is inherently a lazy, best-effort action. From our current perspective, allowing the reclamation of these boundary pods carries a greater cost than not reclaiming them. Firstly, if we allow the reclamation of these boundary pods, it could result in the target queue's allocated value being less than its deserved value. When new pods arrive in this queue, they can still reclaim from other queues. If this is the case, why not just keep the previously reclaimed pods? Secondly, there is the issue I mentioned regarding cyclic eviction and reclamation. I have already raised an issue here, you can check #4947, and we can further discuss this issue in the future. In fact, in the scenario of reclaiming boundary pods, a conservative approach is more appropriate than an aggressive one. |
|
Thank you for raising the issue! Subbug2 is a great catch too. Edit: This is not as bad as I thought in the first place as
It feels like we are introducing a bug here (at least for me), and solving an other which will maybe never come up for me, or if it does it will resolve itself after some time as jobs are finishing on the cluster.
I agree with this, but I think this solution is kind of aggresively conservative.
Because with that pod we were above deserved, so we should have reclaimed it. |
|
Another idea is that we could fine-tune this to be less conservative. At the very end of the reclaim process after line 155 we check your comparision This would require extracting the queue comparison logic from the embedded QueueOrderFn into a reusable function (e.g., |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a cyclic eviction issue in the capacity plugin where pods could be repeatedly evicted and rescheduled in a loop, causing scheduling instability.
Which issue(s) this PR fixes:
Fixes # #4947 subBug 1:
The capacity plugin could enter a cyclic eviction loop in the following scenario:
Example:
Queue1: deserved=3c, has a running pod1 requesting 4c
Queue2: deserved=5c, has a running pod2 requesting 1c, and a pending pod3 requesting 4c
The Cycle:
Reclaim phase: Queue2's pod3 triggers eviction of Queue1's pod1 (since Queue1 is over quota: 4c > 3c)
Post-eviction: Queue1's allocated becomes 0, share = 0/3 = 0
Allocate phase: Queue1 gets higher priority (share=0 < Queue2's share=0.2)
Scheduling: pod1 gets rescheduled to Queue1, allocated = 4c again
Repeat: Queue2's pending pod triggers eviction again → cycle repeats
Root Cause
The issue occurs because the ReclaimableFn only checks if the current allocated > deserved, but doesn't verify whether eviction would cause allocated to drop below deserved. This allows over-reclaim, making the queue immediately "starving" and eligible for high-priority scheduling.
Solution
Add a check in the capacity plugin's ReclaimableFn to prevent eviction if it would cause the queue's allocated resources to fall below its deserved quota in any resource dimension.
Special notes for your reviewer:
Does this PR introduce a user-facing change?