Skip to content

Conversation

@zhifei92
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

The LessEqual and Less(even though it's no longer used anywhere.) functions in pkg/scheduler/api/resource_info.go have an incorrect logic when using Infinity mode. When the left resource (r) is missing a scalar resource dimension (e.g., GPU) but the right resource (rr) has it, the function incorrectly returns false.

Expected Behavior

When r doesn't have a dimension that rr has, it should be treated as r[dimension] = 0, and the comparison should continue (e.g., 0 <= rr[dimension] returns true).

Actual Behavior (Before Fix)

The function returns false immediately, treating the missing dimension as if r exceeds rr, which is semantically incorrect.

Root Cause

The problematic code :

if defaultValue == Infinity {
for name := range rr.ScalarResources {
if _, ok := r.ScalarResources[name]; !ok {
return false
}
}
}

This logic enforces that r must contain all dimensions present in rr.

Impact and Necessity of Fix
1. Incorrect Eviction in Capacity Plugin

The most critical impact is in the capacity plugin's reclaim logic:

// When scalar resource not specified in deserved such as "pods", we should skip it and consider it as infinity,
// so the following first condition will be true and the current queue will not be reclaimed.
if allocated.LessEqual(attr.deserved, api.Infinity) || !attr.guarantee.LessEqual(exceptReclaimee, api.Zero) {
continue
}

Bug Scenario:

allocated := {CPU: 2000m, Memory: 4Gi}           // No GPU requested
deserved  := {CPU: 4000m, Memory: 8Gi, GPU: 2}  // GPU configured

// Before fix:
allocated.LessEqual(deserved, Infinity) = false  // Wrong!Continue to guarantee checkMay incorrectly reclaim resources**Causes unexpected pod eviction**

// After fix:
allocated.LessEqual(deserved, Infinity) = true   // Correct!Skip reclamation**Avoids incorrect eviction**

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed unexpected pod evictions caused by unused resource types in queue quota configuration.

@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 30, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @zhifei92, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the resource comparison logic within the scheduler's API, specifically affecting the LessEqual and Less functions when operating in Infinity mode. Previously, these functions would incorrectly return false if a resource being compared lacked a scalar dimension present in the other resource, leading to flawed comparisons. The corrected logic now treats missing dimensions as zero, ensuring accurate comparisons and, crucially, preventing erroneous pod evictions that were occurring in the capacity plugin due to this flaw.

Highlights

  • Corrected Infinity Mode Logic: The LessEqual and Less functions in pkg/scheduler/api/resource_info.go now correctly handle Infinity mode when a scalar resource dimension is missing from the left-hand resource. Previously, this scenario led to incorrect comparisons.
  • Prevented Incorrect Pod Evictions: This fix resolves a critical bug in the capacity plugin's reclaim logic, which previously caused unexpected pod evictions due to the flawed resource comparison logic. Missing resource dimensions are now treated as zero, preventing erroneous reclamation.
  • Updated Test Cases: Corresponding test cases in pkg/scheduler/api/resource_info_test.go have been updated to reflect the new, correct behavior of resource comparison functions, changing expected outcomes from false to true for specific scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 30, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and removes a logic bug in Less and LessEqual functions where they would incorrectly return false in Infinity mode if the left-hand resource was missing a scalar dimension present in the right-hand one. This is a crucial fix to prevent incorrect pod evictions.

However, while the incorrect logic is removed, the implementation is still incomplete. The functions now fail to check for resource dimensions that are present in the right-hand resource (rr) but not in the left-hand one (r). According to the intended logic, a missing dimension in r should be treated as zero. I've added suggestions to both functions to complete the logic by iterating over rr's resources as well. This is critical for ensuring the correctness of these comparison functions.

@zhifei92
Copy link
Contributor Author

The LessEqual method that propagates the infinity parameter throughout the project is only used in the capacity plugin.

@JesseStutler
Copy link
Member

/cc

@JesseStutler
Copy link
Member

Looks like consider not set as zero is more reasonable, don't know why #2637 is more willing to consider not set as inifity. /cc @kingeasternsun

@JesseStutler
Copy link
Member

@zhifei92 Hi, please don't forget to sign off your commit using git commit -s

@zhifei92 zhifei92 force-pushed the fix-capacity-reclaim-func branch from d968f53 to c05ae7b Compare January 6, 2026 06:00
@zhifei92
Copy link
Contributor Author

zhifei92 commented Jan 6, 2026

@zhifei92 Hi, please don't forget to sign off your commit using git commit -s

done

@zhifei92 zhifei92 force-pushed the fix-capacity-reclaim-func branch from c05ae7b to 5dd978a Compare January 6, 2026 06:25
@zhifei92
Copy link
Contributor Author

zhifei92 commented Jan 6, 2026

The CI failure doesn't seem to be caused by this PR. I'll update it after https://github.com/volcano-sh/volcano/pull/4914/changes is merged.

@JesseStutler
Copy link
Member

/retest

@JesseStutler
Copy link
Member

@zhifei92 I updated the code first and give CIs a try, but I don't hope we should keep a merge commit, you can rebase the latest code and push again

@zhifei92
Copy link
Contributor Author

zhifei92 commented Jan 6, 2026

you can rebase the latest code and push again

done

@JesseStutler
Copy link
Member

/approve
/cc @hajnalmt

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JesseStutler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2026
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
/cc @kingeasternsun
Sry I missed this PR.
Let's start a discussion here, I think the function works correctly and by that I mean it's working as designed.

It was added in this PR:
#2637
(I added kingeasternsun due to this.)

The problem comes from misusing the function in the capacity plugin in the comparision you mentioned. This is the only place where we compare with api.Infinity hence the problem arose. But design-wise I think, if the left value is not defined in the resource comparsion we should return False when you provide api.Infinity to the function.

The bug you mentioned is actually pretty serious. We are overreclaiming now below deserved.

@volcano-sh-bot volcano-sh-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2026
@hajnalmt
Copy link
Contributor

hajnalmt commented Jan 6, 2026

Okay, I checked the code on my prod env about what did I do here.
It was not a one liner problem to solve unfortunately, and I was reluctant to raise a PR on this yet, because I don't like my code here. I want to mention that I have tried to solve a diffent issue here. I needed to create queues without cpu and memory on deserved and it wasn't working properly due to this same comparision.

It feels hacky still, but it actually solves this issue too so I have raised a PR in the end, based on this change and tried to summerize the issues I have met during the development of this here:
#4918

@JesseStutler JesseStutler removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 7, 2026
@JesseStutler
Copy link
Member

/hold /cc @kingeasternsun Sry I missed this PR. Let's start a discussion here, I think the function works correctly and by that I mean it's working as designed.

It was added in this PR: #2637 (I added kingeasternsun due to this.)

The problem comes from misusing the function in the capacity plugin in the comparision you mentioned. This is the only place where we compare with api.Infinity hence the problem arose. But design-wise I think, if the left value is not defined in the resource comparsion we should return False when you provide api.Infinity to the function.

The bug you mentioned is actually pretty serious. We are overreclaiming now below deserved.

Thanks mate you reminded me, I think we'd rather use LessEqualPartly in the code rather than delete codes in the LessEqual, Although I think the logic of LessEqual is not quite right here, because I think the non-existence of resources in r should not be regarded as infinity.

@JesseStutler
Copy link
Member

Okay, I checked the code on my prod env about what did I do here. It was not a one liner problem to solve unfortunately, and I was reluctant to raise a PR on this yet, because I don't like my code here. I want to mention that I have tried to solve a diffent issue here. I needed to create queues without cpu and memory on deserved and it wasn't working properly due to this same comparision.

It feels hacky still, but it actually solves this issue too so I have raised a PR in the end, based on this change and tried to summerize the issues I have met during the development of this here: #4918

As @hajnalmt mentioned in #4918 bug1, I do think we wrongfully didn't consider the resources reclaimer request, only compares the allocated and deserverd(but we still need to compare the allocated and deserved), we just need to add more judgements that we need to also judge that whether the scheduler has already reclaimed enough resources for the reclaimer

@hajnalmt
Copy link
Contributor

hajnalmt commented Jan 9, 2026

Okay, I think I managed to bring this to a much more acceptable form:
#4927
#4919
Any feedbacks are welcomed!
@zhifei92 What do you think can we have these modifications instead of this? I still think that the function signature here works correctly, since the blank dimensions in general should be handled as api.Infinity so if anything is present on the right but not on the left than the left is bigger.

@JesseStutler
Copy link
Member

Okay, I think I managed to bring this to a much more acceptable form: #4927 #4919 Any feedbacks are welcomed! @zhifei92 What do you think can we have these modifications instead of this? I still think that the function signature here works correctly, since the blank dimensions in general should be handled as api.Infinity so if anything is present on the right but not on the left than the left is bigger.

Thanks! I'd prefer to follow the way in #4919

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants