Skip to content

Conversation

@zjj2wry
Copy link
Contributor

@zjj2wry zjj2wry commented Dec 23, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Scheduler crashes and blocks all scheduling when queue hierarchy validation fails. Invalid queue configurations (child capability > parent, sum of children's guarantee > parent's guarantee) could be created via kubectl, causing cluster-wide scheduling failures.

  • Added webhook validation - Validates hierarchical queue constraints at admission time:
    • Child capability ≤ parent capability
    • Sum of siblings/children guarantee ≤ parent guarantee
  • Changed scheduler behavior - checkHierarchicalQueue() now only logs warnings instead of returning errors, preventing cluster-wide scheduling failures

Which issue(s) this PR fixes:

Fixes #4818 #4819

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 23, 2025
@volcano-sh-bot
Copy link
Contributor

Welcome @zjj2wry! It looks like this is your first PR to volcano-sh/volcano 🎉

@volcano-sh-bot volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 23, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @zjj2wry, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where invalid hierarchical queue configurations could lead to scheduler crashes, blocking all scheduling operations. It introduces a two-pronged approach: first, by adding robust webhook validation to prevent such invalid configurations from being admitted in the first place, and second, by making the scheduler more resilient by changing its behavior to log warnings instead of failing when it encounters hierarchical queue validation issues. This ensures greater stability and prevents cluster-wide scheduling failures.

Highlights

  • Enhanced Queue Validation: Implemented new webhook validation rules to prevent invalid hierarchical queue configurations (e.g., child capability exceeding parent, sum of children's guarantee exceeding parent's guarantee) at admission time.
  • Improved Scheduler Resilience: Modified the scheduler's hierarchical queue validation logic to log warnings instead of returning errors, ensuring that misconfigurations do not halt the entire scheduling process.
  • Consistent Resource String Output: Updated resource string formatting to sort scalar resource names, preventing non-deterministic output that could lead to duplicate Kubernetes events.
  • Queue Event Recording: Added a new utility function to record Kubernetes events specifically for queue objects, improving observability.
  • Build Process Updates: Introduced dedicated Makefile targets for building vc-controller-manager, vc-scheduler, and vc-webhook-manager images.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@zjj2wry zjj2wry changed the title Fix: prevent scheduling failure new fix: prevent cluster-wide scheduling failure due to queue hierarchy validation Dec 23, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a critical fix to prevent the scheduler from crashing due to invalid queue configurations. It achieves this by adding a validating webhook to reject invalid queue hierarchies at admission time and by changing the scheduler's behavior to log warnings instead of erroring out on such configurations. The changes are well-implemented and include thorough testing for the new validation logic. I have a few suggestions to improve code clarity and maintainability.

@guoqinwill
Copy link
Contributor

guoqinwill commented Dec 27, 2025

I think we can add another function here to independently validate the queue's capability, deserved, and guarantee quotas, such as using 'ValidateResourceQuantityValue' for legality checks, for example, not allowing quota values to be negative and requiring that guarantee <= deserved <= capability.
pkg/webhooks/admission/queues/validate/validate_queue.go#L128-L301

@hzxuzhonghu
Copy link
Member

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Dec 27, 2025
@guoqinwill
Copy link
Contributor

In the capacity plugin, the root queue should have unlimited default resources, allowing tasks that exceed resource limits to be scheduled normally.

rootQueueAttr := cp.queueOpts[api.QueueID(cp.rootQueue)]
if rootQueueAttr.capability.IsEmpty() {
rootQueueAttr.capability = cp.totalResource
}
if rootQueueAttr.deserved.IsEmpty() {
rootQueueAttr.deserved = cp.totalResource
}
rootQueueAttr.realCapability = cp.totalResource

This can also resolve the issue you mentioned where reclaim conflicts with enqueue. At the same time, we should reject modifications or updates related to the spec of the root queue, because the root queue is automatically created by Volcano and should not restrict user job submissions based on cluster resources. There has been related discussion on this topic (see #4662 (comment)), but it seems that the author did not submit this change. I think your submission can achieve this.

@zjj2wry
Copy link
Contributor Author

zjj2wry commented Dec 28, 2025

I think we can add another function here to independently validate the queue's capability, deserved, and guarantee quotas, such as using 'ValidateResourceQuantityValue' for legality checks, for example, not allowing quota values to be negative and requiring that guarantee <= deserved <= capability. pkg/webhooks/admission/queues/validate/validate_queue.go#L128-L301

@guoqinwill Good catch! I've implemented the independent quota validation function as you suggested.

@zjj2wry
Copy link
Contributor Author

zjj2wry commented Dec 28, 2025

In the capacity plugin, the root queue should have unlimited default resources, allowing tasks that exceed resource limits to be scheduled normally.

rootQueueAttr := cp.queueOpts[api.QueueID(cp.rootQueue)]
if rootQueueAttr.capability.IsEmpty() {
rootQueueAttr.capability = cp.totalResource
}
if rootQueueAttr.deserved.IsEmpty() {
rootQueueAttr.deserved = cp.totalResource
}
rootQueueAttr.realCapability = cp.totalResource

This can also resolve the issue you mentioned where reclaim conflicts with enqueue. At the same time, we should reject modifications or updates related to the spec of the root queue, because the root queue is automatically created by Volcano and should not restrict user job submissions based on cluster resources. There has been related discussion on this topic (see #4662 (comment)), but it seems that the author did not submit this change. I think your submission can achieve this.

The scheduler framework actively updates the root queue's Spec.Deserved and Spec.Guarantee.Resource in

// updateRootQueueResources updates the deserved/guaranteed resource and allocated resource of the root queue
func updateRootQueueResources(ssn *Session, allocated v1.ResourceList) {
rootQueue := api.QueueID("root")
totalDeserved := util.ConvertRes2ResList(ssn.TotalDeserved).DeepCopy()
totalGuarantee := util.ConvertRes2ResList(ssn.TotalGuarantee).DeepCopy()
if equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Spec.Deserved, totalDeserved) &&
equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Spec.Guarantee.Resource, totalGuarantee) &&
equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Status.Allocated, allocated) {
klog.V(5).Infof("Root queue deserved/guaranteed resource and allocated resource remains the same, no need to update the queue.")
return
}
queue := &vcv1beta1.Queue{}
err := schedulingscheme.Scheme.Convert(ssn.Queues[rootQueue].Queue, queue, nil)
if err != nil {
klog.Errorf("failed to convert scheduling.Queue to v1beta1.Queue: %s", err.Error())
return
}
if !equality.Semantic.DeepEqual(queue.Spec.Deserved, totalDeserved) ||
!equality.Semantic.DeepEqual(queue.Spec.Guarantee.Resource, totalGuarantee) {
queue.Spec.Deserved = totalDeserved
queue.Spec.Guarantee.Resource = totalGuarantee
queue, err = ssn.VCClient().SchedulingV1beta1().Queues().Update(context.TODO(), queue, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("failed to update root queue: %s", err.Error())
return
}
}
if !equality.Semantic.DeepEqual(queue.Status.Allocated, allocated) {
queue.Status.Allocated = allocated
_, err = ssn.VCClient().SchedulingV1beta1().Queues().UpdateStatus(context.TODO(), queue, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("failed to update root queue status: %s", err.Error())
return
}
}
}

In my use case, I need:

  • Root queue during scheduling: Unlimited resources (to avoid blocking jobs)
  • Root queue quota (guarantee/deserved/capability): Set to cluster total resources (as a reference limit for child queue creation)

@zjj2wry zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch 5 times, most recently from f59933e to c249a55 Compare December 29, 2025 02:18
@guoqinwill
Copy link
Contributor

In the capacity plugin, the root queue should have unlimited default resources, allowing tasks that exceed resource limits to be scheduled normally.

rootQueueAttr := cp.queueOpts[api.QueueID(cp.rootQueue)]
if rootQueueAttr.capability.IsEmpty() {
rootQueueAttr.capability = cp.totalResource
}
if rootQueueAttr.deserved.IsEmpty() {
rootQueueAttr.deserved = cp.totalResource
}
rootQueueAttr.realCapability = cp.totalResource

This can also resolve the issue you mentioned where reclaim conflicts with enqueue. At the same time, we should reject modifications or updates related to the spec of the root queue, because the root queue is automatically created by Volcano and should not restrict user job submissions based on cluster resources. There has been related discussion on this topic (see #4662 (comment)), but it seems that the author did not submit this change. I think your submission can achieve this.

The scheduler framework actively updates the root queue's Spec.Deserved and Spec.Guarantee.Resource in

// updateRootQueueResources updates the deserved/guaranteed resource and allocated resource of the root queue
func updateRootQueueResources(ssn *Session, allocated v1.ResourceList) {
rootQueue := api.QueueID("root")
totalDeserved := util.ConvertRes2ResList(ssn.TotalDeserved).DeepCopy()
totalGuarantee := util.ConvertRes2ResList(ssn.TotalGuarantee).DeepCopy()
if equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Spec.Deserved, totalDeserved) &&
equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Spec.Guarantee.Resource, totalGuarantee) &&
equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Status.Allocated, allocated) {
klog.V(5).Infof("Root queue deserved/guaranteed resource and allocated resource remains the same, no need to update the queue.")
return
}
queue := &vcv1beta1.Queue{}
err := schedulingscheme.Scheme.Convert(ssn.Queues[rootQueue].Queue, queue, nil)
if err != nil {
klog.Errorf("failed to convert scheduling.Queue to v1beta1.Queue: %s", err.Error())
return
}
if !equality.Semantic.DeepEqual(queue.Spec.Deserved, totalDeserved) ||
!equality.Semantic.DeepEqual(queue.Spec.Guarantee.Resource, totalGuarantee) {
queue.Spec.Deserved = totalDeserved
queue.Spec.Guarantee.Resource = totalGuarantee
queue, err = ssn.VCClient().SchedulingV1beta1().Queues().Update(context.TODO(), queue, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("failed to update root queue: %s", err.Error())
return
}
}
if !equality.Semantic.DeepEqual(queue.Status.Allocated, allocated) {
queue.Status.Allocated = allocated
_, err = ssn.VCClient().SchedulingV1beta1().Queues().UpdateStatus(context.TODO(), queue, metav1.UpdateOptions{})
if err != nil {
klog.Errorf("failed to update root queue status: %s", err.Error())
return
}
}
}

In my use case, I need:

  • Root queue during scheduling: Unlimited resources (to avoid blocking jobs)
  • Root queue quota (guarantee/deserved/capability): Set to cluster total resources (as a reference limit for child queue creation)

The update to root when closing the session here is actually unnecessary because the sub-queues of root do not need to check guarantee, deserved, or capability. So each session performs the update, but in reality, it doesn’t have much effect.

@zjj2wry zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from 575787d to 3b99f42 Compare December 29, 2025 02:34
@zjj2wry zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from d41c10a to 276ef22 Compare December 29, 2025 02:42
@JesseStutler
Copy link
Member

@zjj2wry Hi, thanks for your contribution, please don't forget to clean and squash your commits, currenly there are too many commits in your PR, you can keep 1-3 clean commits

@guoqinwill
Copy link
Contributor

I mentioned a few points about verification interception again this morning, but they don't seem to have been modified yet. Please take a look at these.

Are you suggesting introducing a prohibition on root updates in this PR? The prerequisite would be to remove the root update code in volcano, but I don't know if any users are currently using information in the root queue, which could break compatibility.

No, it's the other suggestions in validate_queue.go. You can take another look at the few comments of mine that are pending.

@guoqinwill Did you submit your review comments, the contributor can't see pending review comments if you didn't submit it

Sorry, my problem, I forgot to submit it.

@volcano-sh-bot volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 30, 2025
@zjj2wry zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from a621cda to 5a45758 Compare December 30, 2025 07:29
@hzxuzhonghu
Copy link
Member

@zjj2wry please squash

@zjj2wry
Copy link
Contributor Author

zjj2wry commented Dec 30, 2025

@zjj2wry please squash

I'll squash these commits after the review. Keeping them separate now just makes reviewing easier.

@zjj2wry zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from 5a45758 to 3b19ee0 Compare December 30, 2025 10:08
@zjj2wry zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch 5 times, most recently from 52895ac to dbcce33 Compare January 5, 2026 02:56
@zjj2wry zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from dbcce33 to 2110b5b Compare January 5, 2026 07:58
@zhaizhicheng
Copy link

Finally, someone is fixing this bug. Looking forward to it being merged into the main branch soon

@JesseStutler
Copy link
Member

/approve
Currently I'm ok with current change, allow @guoqinwill do more fine grained validation in the future
/cc @guoqinwill @hzxuzhonghu

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JesseStutler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2026
@guoqinwill
Copy link
Contributor

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2026
@volcano-sh-bot volcano-sh-bot merged commit 46f99f8 into volcano-sh:master Jan 8, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing Validation for Child Queue Guarantee Sum Causes Cluster-Wide Scheduling Failure

6 participants