fix: prevent cluster-wide scheduling failure due to queue hierarchy validation #4864

zjj2wry · 2025-12-23T14:01:10Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Scheduler crashes and blocks all scheduling when queue hierarchy validation fails. Invalid queue configurations (child capability > parent, sum of children's guarantee > parent's guarantee) could be created via kubectl, causing cluster-wide scheduling failures.

Added webhook validation - Validates hierarchical queue constraints at admission time:
- Child capability ≤ parent capability
- Sum of siblings/children guarantee ≤ parent guarantee
Changed scheduler behavior - checkHierarchicalQueue() now only logs warnings instead of returning errors, preventing cluster-wide scheduling failures

Which issue(s) this PR fixes:

Fixes #4818 #4819

Special notes for your reviewer:

Does this PR introduce a user-facing change?

volcano-sh-bot · 2025-12-23T14:01:22Z

Welcome @zjj2wry! It looks like this is your first PR to volcano-sh/volcano 🎉

gemini-code-assist · 2025-12-23T14:01:32Z

Summary of Changes

Hello @zjj2wry, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where invalid hierarchical queue configurations could lead to scheduler crashes, blocking all scheduling operations. It introduces a two-pronged approach: first, by adding robust webhook validation to prevent such invalid configurations from being admitted in the first place, and second, by making the scheduler more resilient by changing its behavior to log warnings instead of failing when it encounters hierarchical queue validation issues. This ensures greater stability and prevents cluster-wide scheduling failures.

Highlights

Enhanced Queue Validation: Implemented new webhook validation rules to prevent invalid hierarchical queue configurations (e.g., child capability exceeding parent, sum of children's guarantee exceeding parent's guarantee) at admission time.
Improved Scheduler Resilience: Modified the scheduler's hierarchical queue validation logic to log warnings instead of returning errors, ensuring that misconfigurations do not halt the entire scheduling process.
Consistent Resource String Output: Updated resource string formatting to sort scalar resource names, preventing non-deterministic output that could lead to duplicate Kubernetes events.
Queue Event Recording: Added a new utility function to record Kubernetes events specifically for queue objects, improving observability.
Build Process Updates: Introduced dedicated Makefile targets for building vc-controller-manager, vc-scheduler, and vc-webhook-manager images.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a critical fix to prevent the scheduler from crashing due to invalid queue configurations. It achieves this by adding a validating webhook to reject invalid queue hierarchies at admission time and by changing the scheduler's behavior to log warnings instead of erroring out on such configurations. The changes are well-implemented and include thorough testing for the new validation logic. I have a few suggestions to improve code clarity and maintainability.

Makefile

pkg/scheduler/plugins/capacity/capacity.go

pkg/webhooks/admission/queues/validate/validate_queue.go

guoqinwill · 2025-12-27T07:08:44Z

I think we can add another function here to independently validate the queue's capability, deserved, and guarantee quotas, such as using 'ValidateResourceQuantityValue' for legality checks, for example, not allowing quota values to be negative and requiring that guarantee <= deserved <= capability.
pkg/webhooks/admission/queues/validate/validate_queue.go#L128-L301

hzxuzhonghu · 2025-12-27T08:50:00Z

/ok-to-test

guoqinwill · 2025-12-27T08:51:15Z

In the capacity plugin, the root queue should have unlimited default resources, allowing tasks that exceed resource limits to be scheduled normally.

volcano/pkg/scheduler/plugins/capacity/capacity.go

Lines 599 to 606 in 2090e3b

    
           rootQueueAttr := cp.queueOpts[api.QueueID(cp.rootQueue)] 
        
           if rootQueueAttr.capability.IsEmpty() { 
        
           	rootQueueAttr.capability = cp.totalResource 
        
           } 
        
           if rootQueueAttr.deserved.IsEmpty() { 
        
           	rootQueueAttr.deserved = cp.totalResource 
        
           } 
        
           rootQueueAttr.realCapability = cp.totalResource

This can also resolve the issue you mentioned where reclaim conflicts with enqueue. At the same time, we should reject modifications or updates related to the spec of the root queue, because the root queue is automatically created by Volcano and should not restrict user job submissions based on cluster resources. There has been related discussion on this topic (see #4662 (comment)), but it seems that the author did not submit this change. I think your submission can achieve this.

pkg/webhooks/admission/queues/validate/validate_queue.go

pkg/scheduler/plugins/capacity/capacity.go

zjj2wry · 2025-12-28T07:05:48Z

I think we can add another function here to independently validate the queue's capability, deserved, and guarantee quotas, such as using 'ValidateResourceQuantityValue' for legality checks, for example, not allowing quota values to be negative and requiring that guarantee <= deserved <= capability. pkg/webhooks/admission/queues/validate/validate_queue.go#L128-L301

@guoqinwill Good catch! I've implemented the independent quota validation function as you suggested.

zjj2wry · 2025-12-28T07:31:32Z

In the capacity plugin, the root queue should have unlimited default resources, allowing tasks that exceed resource limits to be scheduled normally.

volcano/pkg/scheduler/plugins/capacity/capacity.go

Lines 599 to 606 in 2090e3b

rootQueueAttr := cp.queueOpts[api.QueueID(cp.rootQueue)]

if rootQueueAttr.capability.IsEmpty() {

rootQueueAttr.capability = cp.totalResource

}

if rootQueueAttr.deserved.IsEmpty() {

rootQueueAttr.deserved = cp.totalResource

}

rootQueueAttr.realCapability = cp.totalResource

This can also resolve the issue you mentioned where reclaim conflicts with enqueue. At the same time, we should reject modifications or updates related to the spec of the root queue, because the root queue is automatically created by Volcano and should not restrict user job submissions based on cluster resources. There has been related discussion on this topic (see #4662 (comment)), but it seems that the author did not submit this change. I think your submission can achieve this.

The scheduler framework actively updates the root queue's Spec.Deserved and Spec.Guarantee.Resource in

volcano/pkg/scheduler/framework/session.go

Lines 546 to 585 in 3275f8a

    
           // updateRootQueueResources updates the deserved/guaranteed resource and allocated resource of the root queue 
        
           func updateRootQueueResources(ssn *Session, allocated v1.ResourceList) { 
        
           	rootQueue := api.QueueID("root") 
        
           	totalDeserved := util.ConvertRes2ResList(ssn.TotalDeserved).DeepCopy() 
        
           	totalGuarantee := util.ConvertRes2ResList(ssn.TotalGuarantee).DeepCopy() 
        
           	if equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Spec.Deserved, totalDeserved) && 
        
           		equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Spec.Guarantee.Resource, totalGuarantee) && 
        
           		equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Status.Allocated, allocated) { 
        
           		klog.V(5).Infof("Root queue deserved/guaranteed resource and allocated resource remains the same, no need to update the queue.") 
        
           		return 
        
           	} 
        
           	queue := &vcv1beta1.Queue{} 
        
           	err := schedulingscheme.Scheme.Convert(ssn.Queues[rootQueue].Queue, queue, nil) 
        
           	if err != nil { 
        
           		klog.Errorf("failed to convert scheduling.Queue to v1beta1.Queue: %s", err.Error()) 
        
           		return 
        
           	} 
        
           	if !equality.Semantic.DeepEqual(queue.Spec.Deserved, totalDeserved) || 
        
           		!equality.Semantic.DeepEqual(queue.Spec.Guarantee.Resource, totalGuarantee) { 
        
           		queue.Spec.Deserved = totalDeserved 
        
           		queue.Spec.Guarantee.Resource = totalGuarantee 
        
           		queue, err = ssn.VCClient().SchedulingV1beta1().Queues().Update(context.TODO(), queue, metav1.UpdateOptions{}) 
        
           		if err != nil { 
        
           			klog.Errorf("failed to update root queue: %s", err.Error()) 
        
           			return 
        
           		} 
        
           	} 
        
           	if !equality.Semantic.DeepEqual(queue.Status.Allocated, allocated) { 
        
           		queue.Status.Allocated = allocated 
        
           		_, err = ssn.VCClient().SchedulingV1beta1().Queues().UpdateStatus(context.TODO(), queue, metav1.UpdateOptions{}) 
        
           		if err != nil { 
        
           			klog.Errorf("failed to update root queue status: %s", err.Error()) 
        
           			return 
        
           		} 
        
           	} 
        
           }

In my use case, I need:

Root queue during scheduling: Unlimited resources (to avoid blocking jobs)
Root queue quota (guarantee/deserved/capability): Set to cluster total resources (as a reference limit for child queue creation)

guoqinwill · 2025-12-29T02:20:30Z

In the capacity plugin, the root queue should have unlimited default resources, allowing tasks that exceed resource limits to be scheduled normally.

volcano/pkg/scheduler/plugins/capacity/capacity.go

Lines 599 to 606 in 2090e3b

rootQueueAttr := cp.queueOpts[api.QueueID(cp.rootQueue)]

if rootQueueAttr.capability.IsEmpty() {

rootQueueAttr.capability = cp.totalResource

}

if rootQueueAttr.deserved.IsEmpty() {

rootQueueAttr.deserved = cp.totalResource

}

rootQueueAttr.realCapability = cp.totalResource

This can also resolve the issue you mentioned where reclaim conflicts with enqueue. At the same time, we should reject modifications or updates related to the spec of the root queue, because the root queue is automatically created by Volcano and should not restrict user job submissions based on cluster resources. There has been related discussion on this topic (see #4662 (comment)), but it seems that the author did not submit this change. I think your submission can achieve this.

The scheduler framework actively updates the root queue's Spec.Deserved and Spec.Guarantee.Resource in

volcano/pkg/scheduler/framework/session.go

Lines 546 to 585 in 3275f8a

// updateRootQueueResources updates the deserved/guaranteed resource and allocated resource of the root queue

func updateRootQueueResources(ssn *Session, allocated v1.ResourceList) {

rootQueue := api.QueueID("root")

totalDeserved := util.ConvertRes2ResList(ssn.TotalDeserved).DeepCopy()

totalGuarantee := util.ConvertRes2ResList(ssn.TotalGuarantee).DeepCopy()

if equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Spec.Deserved, totalDeserved) &&

equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Spec.Guarantee.Resource, totalGuarantee) &&

equality.Semantic.DeepEqual(ssn.Queues[rootQueue].Queue.Status.Allocated, allocated) {

klog.V(5).Infof("Root queue deserved/guaranteed resource and allocated resource remains the same, no need to update the queue.")

return

}

queue := &vcv1beta1.Queue{}

err := schedulingscheme.Scheme.Convert(ssn.Queues[rootQueue].Queue, queue, nil)

if err != nil {

klog.Errorf("failed to convert scheduling.Queue to v1beta1.Queue: %s", err.Error())

return

}

if !equality.Semantic.DeepEqual(queue.Spec.Deserved, totalDeserved) ||

!equality.Semantic.DeepEqual(queue.Spec.Guarantee.Resource, totalGuarantee) {

queue.Spec.Deserved = totalDeserved

queue.Spec.Guarantee.Resource = totalGuarantee

queue, err = ssn.VCClient().SchedulingV1beta1().Queues().Update(context.TODO(), queue, metav1.UpdateOptions{})

if err != nil {

klog.Errorf("failed to update root queue: %s", err.Error())

return

}

}

if !equality.Semantic.DeepEqual(queue.Status.Allocated, allocated) {

queue.Status.Allocated = allocated

_, err = ssn.VCClient().SchedulingV1beta1().Queues().UpdateStatus(context.TODO(), queue, metav1.UpdateOptions{})

if err != nil {

klog.Errorf("failed to update root queue status: %s", err.Error())

return

}

}

}

In my use case, I need:

Root queue during scheduling: Unlimited resources (to avoid blocking jobs)

Root queue quota (guarantee/deserved/capability): Set to cluster total resources (as a reference limit for child queue creation)

The update to root when closing the session here is actually unnecessary because the sub-queues of root do not need to check guarantee, deserved, or capability. So each session performs the update, but in reality, it doesn’t have much effect.

JesseStutler · 2025-12-29T12:15:30Z

@zjj2wry Hi, thanks for your contribution, please don't forget to clean and squash your commits, currenly there are too many commits in your PR, you can keep 1-3 clean commits

pkg/webhooks/admission/queues/validate/validate_queue.go

pkg/scheduler/plugins/capacity/capacity.go

pkg/webhooks/admission/queues/validate/validate_queue.go

pkg/scheduler/plugins/capacity/capacity.go

pkg/webhooks/admission/queues/validate/validate_queue.go

guoqinwill · 2025-12-29T12:22:28Z

I mentioned a few points about verification interception again this morning, but they don't seem to have been modified yet. Please take a look at these.

Are you suggesting introducing a prohibition on root updates in this PR? The prerequisite would be to remove the root update code in volcano, but I don't know if any users are currently using information in the root queue, which could break compatibility.

No, it's the other suggestions in validate_queue.go. You can take another look at the few comments of mine that are pending.

@guoqinwill Did you submit your review comments, the contributor can't see pending review comments if you didn't submit it

Sorry, my problem, I forgot to submit it.

pkg/webhooks/admission/queues/validate/validate_queue.go

hzxuzhonghu · 2025-12-30T08:51:10Z

@zjj2wry please squash

Makefile

pkg/scheduler/plugins/capacity/capacity.go

pkg/webhooks/router/indexer.go

pkg/webhooks/admission/queues/validate/validate_queue.go

zjj2wry · 2025-12-30T09:39:41Z

@zjj2wry please squash

I'll squash these commits after the review. Keeping them separate now just makes reviewing easier.

pkg/webhooks/admission/queues/validate/validate_queue.go

…alidation Signed-off-by: zhengjiajin <[email protected]>

zhaizhicheng · 2026-01-06T08:10:45Z

Finally, someone is fixing this bug. Looking forward to it being merged into the main branch soon

JesseStutler · 2026-01-08T02:26:44Z

/approve
Currently I'm ok with current change, allow @guoqinwill do more fine grained validation in the future
/cc @guoqinwill @hzxuzhonghu

volcano-sh-bot · 2026-01-08T02:26:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JesseStutler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [JesseStutler]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

guoqinwill · 2026-01-08T02:29:05Z

/lgtm

volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 23, 2025

volcano-sh-bot requested review from alcorj-mizar, hzxuzhonghu, wangyang0616 and wpeng102 December 23, 2025 14:01

volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 23, 2025

zjj2wry changed the title ~~Fix: prevent scheduling failure new~~ fix: prevent cluster-wide scheduling failure due to queue hierarchy validation Dec 23, 2025

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

Makefile Outdated Show resolved Hide resolved

pkg/scheduler/plugins/capacity/capacity.go Outdated Show resolved Hide resolved

pkg/webhooks/admission/queues/validate/validate_queue.go Outdated Show resolved Hide resolved

JesseStutler reviewed Dec 26, 2025

View reviewed changes

pkg/webhooks/admission/queues/validate/validate_queue.go Outdated Show resolved Hide resolved

JesseStutler reviewed Dec 26, 2025

View reviewed changes

pkg/webhooks/admission/queues/validate/validate_queue.go Outdated Show resolved Hide resolved

volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Dec 27, 2025

hzxuzhonghu reviewed Dec 27, 2025

View reviewed changes

pkg/webhooks/admission/queues/validate/validate_queue.go Outdated Show resolved Hide resolved

hzxuzhonghu reviewed Dec 27, 2025

View reviewed changes

pkg/webhooks/admission/queues/validate/validate_queue.go Outdated Show resolved Hide resolved

hzxuzhonghu reviewed Dec 27, 2025

View reviewed changes

pkg/scheduler/plugins/capacity/capacity.go Outdated Show resolved Hide resolved

zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch 5 times, most recently from f59933e to c249a55 Compare December 29, 2025 02:18

zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from 575787d to 3b99f42 Compare December 29, 2025 02:34

volcano-sh-bot added the do-not-merge/contains-merge-commits label Dec 29, 2025

zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from d41c10a to 276ef22 Compare December 29, 2025 02:42

guoqinwill reviewed Dec 29, 2025

View reviewed changes

guoqinwill reviewed Dec 30, 2025

View reviewed changes

pkg/webhooks/admission/queues/validate/validate_queue.go Outdated Show resolved Hide resolved

volcano-sh-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 30, 2025

zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from a621cda to 5a45758 Compare December 30, 2025 07:29

hzxuzhonghu reviewed Dec 30, 2025

View reviewed changes

Makefile Show resolved Hide resolved

pkg/scheduler/plugins/capacity/capacity.go Show resolved Hide resolved

pkg/webhooks/router/indexer.go Outdated Show resolved Hide resolved

pkg/webhooks/admission/queues/validate/validate_queue.go Outdated Show resolved Hide resolved

zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from 5a45758 to 3b19ee0 Compare December 30, 2025 10:08

guoqinwill reviewed Dec 30, 2025

View reviewed changes

pkg/webhooks/admission/queues/validate/validate_queue.go Outdated Show resolved Hide resolved

zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch 5 times, most recently from 52895ac to dbcce33 Compare January 5, 2026 02:56

fix: prevent cluster-wide scheduling failure due to queue hierarchy v…

2110b5b

…alidation Signed-off-by: zhengjiajin <[email protected]>

zjj2wry force-pushed the fix/prevent_scheduling_failure-new branch from dbcce33 to 2110b5b Compare January 5, 2026 07:58

volcano-sh-bot requested review from guoqinwill and hzxuzhonghu January 8, 2026 02:26

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2026

volcano-sh-bot assigned guoqinwill Jan 8, 2026

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 8, 2026

volcano-sh-bot merged commit 46f99f8 into volcano-sh:master Jan 8, 2026
22 of 23 checks passed

JesseStutler mentioned this pull request Jan 8, 2026

Hierarchical Queue Capability Can Exceed Cluster Resources, Causing Validation Failures #4819

Closed

fix: prevent cluster-wide scheduling failure due to queue hierarchy validation #4864

fix: prevent cluster-wide scheduling failure due to queue hierarchy validation #4864

Uh oh!

Conversation

zjj2wry commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guoqinwill commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzxuzhonghu commented Dec 27, 2025

Uh oh!

guoqinwill commented Dec 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zjj2wry commented Dec 28, 2025

Uh oh!

zjj2wry commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqinwill commented Dec 29, 2025

Uh oh!

JesseStutler commented Dec 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guoqinwill commented Dec 29, 2025

Uh oh!

Uh oh!

hzxuzhonghu commented Dec 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zjj2wry commented Dec 30, 2025

Uh oh!

Uh oh!

zhaizhicheng commented Jan 6, 2026

Uh oh!

JesseStutler commented Jan 8, 2026

Uh oh!

volcano-sh-bot commented Jan 8, 2026

Uh oh!

guoqinwill commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

zjj2wry commented Dec 23, 2025 •

edited

Loading

guoqinwill commented Dec 27, 2025 •

edited

Loading

zjj2wry commented Dec 28, 2025 •

edited

Loading