Skip to content

Conversation

@mszadkow
Copy link
Contributor

@mszadkow mszadkow commented Jan 8, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #8302

Special notes for your reviewer:

Still working on the E2E as I want to achieve situation when worker that evicted the workload in the first place can not be admitted, but we don't know which one will it be.

Does this PR introduce a user-facing change?


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 8, 2026
@netlify
Copy link

netlify bot commented Jan 8, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 16cb9d9
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6960b6f7af02f80008c826dc

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 8, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mszadkow
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 8, 2026
@mszadkow
Copy link
Contributor Author

mszadkow commented Jan 8, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jan 8, 2026
@mszadkow mszadkow changed the title Fix/8302 reset mk admission cluster WIP Fix/8302 reset mk admission cluster Jan 8, 2026
@mszadkow mszadkow force-pushed the fix/8302-reset-mk-admission-cluster branch from 7d879a3 to 16cb9d9 Compare January 9, 2026 08:06
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 9, 2026
@mszadkow
Copy link
Contributor Author

mszadkow commented Jan 9, 2026

/retest

@mszadkow mszadkow changed the title WIP Fix/8302 reset mk admission cluster [WIP] [Feat] Re-do mk admission after eviction in worker cluster Jan 9, 2026
@mszadkow mszadkow marked this pull request as ready for review January 9, 2026 14:25
@k8s-ci-robot k8s-ci-robot requested a review from kannon92 January 9, 2026 14:25
@k8s-ci-robot
Copy link
Contributor

@mszadkow: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-e2e-main-1-35 16cb9d9 link true /test pull-kueue-test-e2e-main-1-35

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@olekzabl olekzabl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few minor comments.

}

// workload eviction on worker cluster
log.V(5).Info("Workload gets evicted in the remote cluster", "cluster", evictedRemote)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this present tense feels slightly confusing; IIUC the workload already got evicted.

acs.LastTransitionTime = metav1.NewTime(w.clock.Now())
workload.SetAdmissionCheckState(&wl.Status.AdmissionChecks, *acs, w.clock)
wl.Status.ClusterName = nil
wl.Status.NominatedClusterNames = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity - can this have an effect, given this rule?
(or is this intended to replace an empty list with nil in some edge cases?)

}

for cluster := range group.remotes {
if err := client.IgnoreNotFound(group.RemoveRemoteObjects(ctx, cluster)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, other calls to RemoveRemoteObjects in this file are followed by group.remotes[cluster] = nil.
Should it be done also here?


createdAtWorker := ""

ginkgo.By("Checking that the workload is created in one of the workers", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below, you'll modify worker's CQ limits to manipulate what it can fit.
If so, how about using that trick once more, to get control over where the wl lands initially?

AFAICS the initial CPU quotas are: 2 at worker1 and 1 at worker2.
So if you set 1.5 for your workload, it'll certainly land on worker 1.
Then, you could swap the quotas (say, first set 2 at w2, then set 1 at w1) and verify if the workload moved to w2.

Compared to what you have now, +1 CQ update but -4 if blocks. (If I'm not mistaken).

ginkgo.By("Checking that the workload is re-admitted in the other worker cluster", func() {
gomega.Eventually(func(g gomega.Gomega) {
g.Expect(k8sManagerClient.Get(ctx, wlKey, managerWl)).To(gomega.Succeed())
g.Expect(managerWl.Status.ClusterName).NotTo(gomega.HaveValue(gomega.Equal(createdAtWorker)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also check that it's not empty.

}, gomega.Equal(completedJobCondition))))
})
})
ginkgo.It("Should redo the admission process once the workload looses Admission in the worker cluster", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ginkgo.It("Should redo the admission process once the workload looses Admission in the worker cluster", func() {
ginkgo.It("Should redo the admission process once the workload loses Admission in the worker cluster", func() {

ginkgo.By("check manager's workload ClusterName reset", func() {
gomega.Eventually(func(g gomega.Gomega) {
managerWl := &kueue.Workload{}
g.Expect(managerTestCluster.client.Get(worker1TestCluster.ctx, wlLookupKey, managerWl)).To(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
g.Expect(managerTestCluster.client.Get(worker1TestCluster.ctx, wlLookupKey, managerWl)).To(gomega.Succeed())
g.Expect(managerTestCluster.client.Get(managerTestCluster.ctx, wlLookupKey, managerWl)).To(gomega.Succeed())

(or are there reasons to use the other context here?)

// workload eviction on worker cluster
log.V(5).Info("Workload gets evicted in the remote cluster", "cluster", evictedRemote)
needsACUpdate := acs.State == kueue.CheckStateReady
if err := workload.PatchAdmissionStatus(ctx, w.client, group.local, w.clock, func(wl *kueue.Workload) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we do it when needsACUpdate is false?
(Currently, the update func does nothing in this case, but it still returns true - looks like we'd send an empty patch request to the apiserver?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MultiKueue should redo the admission process once the workload looses Admission in the worker cluster

3 participants