CountIn() does not assume the requests has 2 or more pods count #4355

tenzen-y · 2025-02-23T16:33:18Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

I refined the CountIn() in the resources package since this does not consider the Pods count despite this being exposed to the external package.
Additionally, I replaced req receiver with r to use the same receiver name all for Requests receivers.

As we discussed in this PR, we decided not to handle that requests have 2 or more Pods.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is a follow-up PR of #4271.

#4271 (comment)

But if we pass two or more Pods as requests like {"pods": 2} to the CountIn() function, it will be broken.
The CountIn() function is exposed, and the current impl can not enforce passing only 1 Pod request as a receiver. So, I would like to add any guard to the CountIn(). But we can do it in a separate PR since the refining does not directly contribute to this bug fixe. @mimowo @PBundyra

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2025-02-23T16:33:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-02-23T16:33:53Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`fad17dd`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67bf36ecc48d6f00087fb038
😎 Deploy Preview	https://deploy-preview-4355--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mimowo

/hold
I'm not sure why we need special handling for the ResourcePods in this function since both Requests and capacity contain the constraint. So it can be considered as any other resource.

Also, I think the semantics of unlimited Pods count capacity is not what kube-scheduler does. Can you double check? By my brief look earlier I thought kube-scheduler assumes 0 in that case.

If there is something confusing about the function maybe it can be clarified by a comment?

tenzen-y · 2025-02-24T12:35:02Z

I'm not sure why we need special handling for the ResourcePods in this function since both Requests and capacity contain the constraint. So it can be considered as any other resource.

The func (r Requests) CountIn(capacity Requests) int32 receiver and args are assumed Requests typed. So, the current implementation could not enforce those that have ResourcePods. So, I added checks if those have ResourcePods.

I think we have 3 options to stabilize this function for ResourcePods in the following:

(current): Check if incoming resources have ResourcePods. If capacity does not have it, we can consider it as infinity.
If the receiver request has ResourcePods but capacity does not have ResourcePods.
a. we ignore the receiver request ResourcePods.
b. we consider the capacity for ResourcePods as 0 the same as kube-scheduler

@mimowo Either approach is valid, IMO. Let me know what you think.

mimowo · 2025-02-24T17:03:49Z

TBH, I'm not sure what is the problem we are trying to solve, and the modification to the function makes it more complex, so it risks subtle bugs, so I'm cautious.

Let me ask some questions:

what is the role of the PR: bug for some rare case or just a cleanup?
why here the result is 4 not 5? I believe prior to the PR it would be 5, so the change is user-facing, the question is if for better or worse :) .

If we need to choose one of the semantics I would definetly be in favor of following kube-scheduler meaning handling it as 0, but again, I'm not sure this function needs to be aware of the Pods. For example, other functions in the package aren't aware of Pods.

mimowo · 2025-02-26T07:12:25Z

/assign

tenzen-y · 2025-02-26T14:41:33Z

TBH, I'm not sure what is the problem we are trying to solve, and the modification to the function makes it more complex, so it risks subtle bugs, so I'm cautious.

Let me ask some questions:

what is the role of the PR: bug for some rare case or just a cleanup?

why here the result is 4 not 5? I believe prior to the PR it would be 5, so the change is user-facing, the question is if for better or worse :) .

I think this is clean up since now, the external package passes only 1 Pod resource as receiver. For the future, if they passes multiple Pods resources, this will return unintended value.
Ah, that makes sense. This is actually my mistake. This function should return 5.

If we need to choose one of the semantics I would definetly be in favor of following kube-scheduler meaning handling it as 0, but again, I'm not sure this function needs to be aware of the Pods. For example, other functions in the package aren't aware of Pods.

That makes sense. In that case, let me align with kube-scheduler and add edge handling when the requests are multiple Pods. Let's say the current request (receiver) is "pods": 3 and capacity is "pods": 4. In that case, they return 0, but the expected result is 3.

EDIT: I refined the example to clarify the problem

mimowo · 2025-02-26T15:00:22Z

the external package passes only 1 Pod resource as receiver. For the future, if they passes multiple Pods resources, this will return unintended value.

Can you elaborate why the result would be invalid if the receiver has ResourcePods=2?

Assuming indeed the result is wrong, could we just return error / panic rather than support? We can extend the support if needed I think,

tenzen-y · 2025-02-26T15:05:22Z

the external package passes only 1 Pod resource as receiver. For the future, if they passes multiple Pods resources, this will return unintended value.

Can you elaborate why the result would be invalid if the receiver has ResourcePods=2?

We can consider this case

Let's say the current request (receiver) is "pods": 3 and capacity is "pods": 4. In that case, they return 0, but the expected result is 3.

Assuming indeed the result is wrong, could we just return error / panic rather than support? We can extend the support if needed I think,

I'm ok with a return error if requests (receiver) have two or more pods count. My point is just want to remove the unhandling case. Is this ok with a return error in the case of situation (pods requests is more than 2)?

mimowo · 2025-02-26T15:13:37Z

In that case, they return 0, but the expected result is 3.

I thought the current would be 1, no? We basically could how many pods with the requests will fit in the capacity. So there will be one po which requests 3 pods inside 4 pods of capacity. Yeah, this gets tricky.

I'm ok to just return an error whenever the value is different than 1, yes.

tenzen-y · 2025-02-26T15:28:47Z

I thought the current would be 1, no? We basically could how many pods with the requests will fit in the capacity. So there will be one po which requests 3 pods inside 4 pods of capacity. Yeah, this gets tricky.

We always calculate MIN(capacity/requests, result) regardless of kind of resource.
So, when the request is {"pods": 3, "cpu": 1} and capacity is {"pods": 4, "cpu": 10}, the CountIn returns 1...

I'm ok to just return an error whenever the value is different than 1, yes.

~~Alright, let's just take it as error.~~
Sorry for your confusion. Yes, actually, the above case result is 1, but actually capacity can accommodate 3 Pods, right? Anyway, I agree with nonhandling the situation (request pods count > 1).

EDIT: I fixed the example case.

Signed-off-by: Yuki Iwai <[email protected]>

mimowo · 2025-02-26T15:44:52Z

Sorry for your confusion. Yes, actually, the above case result is 1, but actually capacity can accommodate 3 Pods, right?

Yes, the capacity can accommodate 3 pods, but when I was introducing this function its aim was to be just a math helper function computing min_r(floor(capacity_r / request_r)). So, I think of this function as about Sub or Add functions.

tenzen-y · 2025-02-26T15:50:46Z

Sorry for your confusion. Yes, actually, the above case result is 1, but actually capacity can accommodate 3 Pods, right?

Yes, the capacity can accommodate 3 pods, but when I was introducing this function its aim was to be just a math helper function computing min_r(floor(capacity_r / request_r)). So, I think of this function as about Sub or Add functions.

That makes sense for the objective.
But, in Math Helper, it would be better to handle edge cases, especially multiply and divide (~= CountIn).
OTOH, I agree with unhandling such edge cases here.

tenzen-y · 2025-02-26T15:56:11Z

I turned this PR to return an error when requests have 2 or more Pods {"pods": 2}.

tenzen-y · 2025-02-26T15:56:42Z

/retitle CountIn() does not assume the requests has 2 or more pods count

k8s-ci-robot · 2025-02-26T15:58:37Z

@tenzen-y: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-test-integration-main	`fad17dd`	link	true	`/test pull-kueue-test-integration-main`
pull-kueue-test-unit-main	`fad17dd`	link	true	`/test pull-kueue-test-unit-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

mimowo · 2025-02-26T17:15:43Z

/hold
Actually, some semantics for ResourcePods >= 2 can be useful for me in #4418

EDIT: the semantics would be as it is currently implemented. I think this makes some sense.

Think of the receiver as the PodGroup, it requests 3 pods. The total capacity is 4 pods. So the question is "how many groups of pods can I fit into the capacity. The answer is 1, so I think this is correct. Maybe some better comment is just needed.

tenzen-y · 2025-02-26T17:17:59Z

/hold Actually, some semantics for ResourcePods >= 2 can be useful for me in #4418

@mimowo As I debugged failre tests, in preemption case, we pass multiple pods to CountIn function.
So, I think that we should appropriately handle 2 or more Pods case in CountIn.

tenzen-y · 2025-02-26T17:24:38Z

Think of the receiver as the PodGroup, it requests 3 pods. The total capacity is 4 pods. So the question is "how many groups of pods can I fit into the capacity. The answer is 1, so I think this is correct. Maybe some better comment is just needed.

That makes sense (I saw older comments when I leave my above comments).
In that case, I think it would be better to replace receiver struct to PodGroupRequests. Currently, which types of requests are assumed is not obviously.

tenzen-y · 2025-02-26T17:27:01Z

In any case, I think that this PR is not needed.
If we need to add comments or struct name replacement, it would be better to open new PR to store this discusstion.

mimowo · 2025-02-26T17:31:12Z

In any case, I think that this PR is not needed.

Maybe rename the receiver or add docs instead? I think the "mental model" of PodGroupRequests is quite useful to have in mind, basically "count how many pod groups like this can fit into the capacity" is the question answered by the function

mimowo · 2025-02-26T17:31:44Z

As I debugged failre tests, in preemption case, we pass multiple pods to CountIn function.

Interesting, I didn't expect that, do you have more details, is this expected?

tenzen-y · 2025-02-26T17:33:51Z

In any case, I think that this PR is not needed.

Maybe rename the receiver or add docs instead? I think the "mental model" of PodGroupRequests is quite useful to have in mind, basically "count how many pod groups like this can fit into the capacity" is the question answered by the function

Yeah, If you prefer to keep those work in this PR, I'm fine with that. I was just assumed to work on it in a separate PR.

tenzen-y · 2025-02-26T17:40:21Z

As I debugged failre tests, in preemption case, we pass multiple pods to CountIn function.

Interesting, I didn't expect that, do you have more details, is this expected?

Oh, it might be a bug...

As I added logger (s.log.Info("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "requests", requests)) to

kueue/pkg/cache/tas_flavor_snapshot.go

Lines 535 to 537 in fad17dd

    
           if leaf.state, err = requests.CountIn(remainingCapacity); err != nil { 
        
           	return err 
        
           }

I got tas_flavor_snapshot.go:537: "level"=0 "msg"="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" "schedulingCycle"=1 "requests"={"cpu"=2000 "pods"=2} logging message in the following cases:

https://github.com/kubernetes-sigs/kueue/blob/main/pkg/scheduler/scheduler_test.go#L5187
kueue/pkg/scheduler/scheduler_test.go

Line 5248 in cfed6f3

"low priority workload is preempted, mid-priority workload survives": {
kueue/pkg/scheduler/scheduler_test.go

Line 5334 in cfed6f3

"low priority workload is preempted even though there is enough capacity, but fragmented": {

tenzen-y · 2025-02-26T17:47:59Z

Ah, I found the root cause. This is a bug since SinglePodRequests are unintended reusing...
I will open another PR to explain what is problem and fix a bug...

tenzen-y · 2025-02-26T17:49:13Z

I feel like I recently fix bugs every week...

mimowo · 2025-02-26T17:57:36Z

Ah, I found the root cause. This is a bug since SinglePodRequests are unintended reusing...
I will open another PR to explain what is problem and fix a bug...

Not good, because we cherry-picked the "fix", should we consider 0.10.2 buggy?

mimowo · 2025-02-26T17:58:21Z

Yeah, If you prefer to keep those work in this PR, I'm fine with that. I was just assumed to work on it in a separate PR.

separate is fine, for sure

tenzen-y · 2025-02-26T18:01:04Z

Ah, I found the root cause. This is a bug since SinglePodRequests are unintended reusing...
I will open another PR to explain what is problem and fix a bug...

Not good, because we cherry-picked the "fix", should we consider 0.10.2 buggy?

Technically, yes. But it happens only when TAS with Preemption since only preemption calls caSnapshot.FindTopologyAssignmentsForWorkload() twise for now.

mimowo · 2025-02-26T18:03:17Z

Technically, yes. But it happens only when TAS with Preemption since only preemption calls caSnapshot.FindTopologyAssignmentsForWorkload() twise for now.

Ok, so it is ok from user PoV, we only release preemptions in 0.11

tenzen-y · 2025-02-26T18:05:28Z

Technically, yes. But it happens only when TAS with Preemption since only preemption calls caSnapshot.FindTopologyAssignmentsForWorkload() twise for now.

Ok, so it is ok from user PoV, we only release preemptions in 0.11

Yeah, we do not need to rush preparing next patch release.
I think the fix might be cherry-picked to release branch.

tenzen-y · 2025-02-26T18:14:05Z

Yeah, If you prefer to keep those work in this PR, I'm fine with that. I was just assumed to work on it in a separate PR.

separate is fine, for sure

Thanks. So I will extract 2 PRs from this PR, 1 is a struct name change, 1 is a bug fix.
After I make those PRs, I will close this.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Feb 23, 2025

k8s-ci-robot requested review from mbobrovskyi and PBundyra February 23, 2025 16:33

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 23, 2025

mimowo reviewed Feb 24, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 24, 2025

k8s-ci-robot assigned mimowo Feb 26, 2025

CountIn() does not assume the requests has 2 or more pods count

fad17dd

Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y force-pushed the check-pod-count branch from 2843808 to fad17dd Compare February 26, 2025 15:44

k8s-ci-robot changed the title ~~Consider Pods count in CountIn()~~ CountIn() does not assume the requests has 2 or more pods count Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CountIn() does not assume the requests has 2 or more pods count #4355

CountIn() does not assume the requests has 2 or more pods count #4355

tenzen-y commented Feb 23, 2025 •

edited

Loading

k8s-ci-robot commented Feb 23, 2025

netlify bot commented Feb 23, 2025 •

edited

Loading

mimowo left a comment

tenzen-y commented Feb 24, 2025

mimowo commented Feb 24, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025 •

edited

Loading

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025 •

edited

Loading

mimowo commented Feb 26, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

k8s-ci-robot commented Feb 26, 2025

mimowo commented Feb 26, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

mimowo commented Feb 26, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

mimowo commented Feb 26, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025 •

edited

Loading

CountIn() does not assume the requests has 2 or more pods count #4355

Are you sure you want to change the base?

CountIn() does not assume the requests has 2 or more pods count #4355

Conversation

tenzen-y commented Feb 23, 2025 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Feb 23, 2025

netlify bot commented Feb 23, 2025 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue ready!

mimowo left a comment

Choose a reason for hiding this comment

tenzen-y commented Feb 24, 2025

mimowo commented Feb 24, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025 • edited Loading

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025 • edited Loading

mimowo commented Feb 26, 2025 • edited Loading

tenzen-y commented Feb 26, 2025 • edited Loading

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

k8s-ci-robot commented Feb 26, 2025

mimowo commented Feb 26, 2025 • edited Loading

tenzen-y commented Feb 26, 2025 • edited Loading

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

mimowo commented Feb 26, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

mimowo commented Feb 26, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

mimowo commented Feb 26, 2025

tenzen-y commented Feb 26, 2025

tenzen-y commented Feb 26, 2025 • edited Loading

tenzen-y commented Feb 23, 2025 •

edited

Loading

netlify bot commented Feb 23, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025 •

edited

Loading

mimowo commented Feb 26, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025 •

edited

Loading

mimowo commented Feb 26, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025 •

edited

Loading

tenzen-y commented Feb 26, 2025 •

edited

Loading