-
Notifications
You must be signed in to change notification settings - Fork 1.3k
fix: prevent cluster-wide scheduling failure due to queue hierarchy validation #4864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent cluster-wide scheduling failure due to queue hierarchy validation #4864
Conversation
|
Welcome @zjj2wry! It looks like this is your first PR to volcano-sh/volcano 🎉 |
Summary of ChangesHello @zjj2wry, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where invalid hierarchical queue configurations could lead to scheduler crashes, blocking all scheduling operations. It introduces a two-pronged approach: first, by adding robust webhook validation to prevent such invalid configurations from being admitted in the first place, and second, by making the scheduler more resilient by changing its behavior to log warnings instead of failing when it encounters hierarchical queue validation issues. This ensures greater stability and prevents cluster-wide scheduling failures. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a critical fix to prevent the scheduler from crashing due to invalid queue configurations. It achieves this by adding a validating webhook to reject invalid queue hierarchies at admission time and by changing the scheduler's behavior to log warnings instead of erroring out on such configurations. The changes are well-implemented and include thorough testing for the new validation logic. I have a few suggestions to improve code clarity and maintainability.
|
I think we can add another function here to independently validate the queue's capability, deserved, and guarantee quotas, such as using 'ValidateResourceQuantityValue' for legality checks, for example, not allowing quota values to be negative and requiring that guarantee <= deserved <= capability. |
|
/ok-to-test |
|
In the capacity plugin, the root queue should have unlimited default resources, allowing tasks that exceed resource limits to be scheduled normally. volcano/pkg/scheduler/plugins/capacity/capacity.go Lines 599 to 606 in 2090e3b
This can also resolve the issue you mentioned where reclaim conflicts with enqueue. At the same time, we should reject modifications or updates related to the spec of the root queue, because the root queue is automatically created by Volcano and should not restrict user job submissions based on cluster resources. There has been related discussion on this topic (see #4662 (comment)), but it seems that the author did not submit this change. I think your submission can achieve this. |
@guoqinwill Good catch! I've implemented the independent quota validation function as you suggested. |
The scheduler framework actively updates the root queue's volcano/pkg/scheduler/framework/session.go Lines 546 to 585 in 3275f8a
In my use case, I need:
|
f59933e to
c249a55
Compare
The update to root when closing the session here is actually unnecessary because the sub-queues of root do not need to check guarantee, deserved, or capability. So each session performs the update, but in reality, it doesn’t have much effect. |
575787d to
3b99f42
Compare
d41c10a to
276ef22
Compare
|
@zjj2wry Hi, thanks for your contribution, please don't forget to clean and squash your commits, currenly there are too many commits in your PR, you can keep 1-3 clean commits |
Sorry, my problem, I forgot to submit it. |
a621cda to
5a45758
Compare
|
@zjj2wry please squash |
I'll squash these commits after the review. Keeping them separate now just makes reviewing easier. |
5a45758 to
3b19ee0
Compare
52895ac to
dbcce33
Compare
…alidation Signed-off-by: zhengjiajin <[email protected]>
dbcce33 to
2110b5b
Compare
|
Finally, someone is fixing this bug. Looking forward to it being merged into the main branch soon |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JesseStutler The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Scheduler crashes and blocks all scheduling when queue hierarchy validation fails. Invalid queue configurations (child capability > parent, sum of children's guarantee > parent's guarantee) could be created via kubectl, causing cluster-wide scheduling failures.
Which issue(s) this PR fixes:
Fixes #4818 #4819
Special notes for your reviewer:
Does this PR introduce a user-facing change?