-
Notifications
You must be signed in to change notification settings - Fork 481
feat(KEP-3258): implement delayed admission check retries #7620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
5cd940c to
5b6df07
Compare
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, thank you 👍
test/integration/singlecluster/controller/admissionchecks/provisioning/provisioning_test.go
Outdated
Show resolved
Hide resolved
eb9934d to
1cd7565
Compare
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, left a couple comments, once this is done please follow up with refactoring of ProvisioningRequest to use the new API, this is already started by @dhenkel92 here: https://github.com/kubernetes-sigs/kueue/pull/7464/files#diff-d5b88e9a8af6b97ce61788f1307ec9ba1f4e3581a9c0634a151ce310c9ca3d91
| if wl.Status.RequeueState != nil { | ||
| wl.Status.RequeueState.RequeueAt = nil | ||
| } | ||
| return &wl, true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we can also add workload.SetRequeuedCondition(&wl, kueue.WorkloadBackoffFinished, "The workload backoff was finished", true)
It is probably a follow up for later, but I would like to consolidate the two requests (here & pre-existing below) into one. They are really the same thing conceptually.
Still, this would mean a bigger diff so I'm ok as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also extend one of the test asserting to check that the Requeued=true condition is set.
test/integration/singlecluster/scheduler/delayedadmission/delayed_admission_test.go
Outdated
Show resolved
Hide resolved
540f431 to
92e8f4b
Compare
d59062d to
a77160a
Compare
a77160a to
943629b
Compare
| if wl.Status.RequeueState != nil { | ||
| wl.Status.RequeueState = nil | ||
| if wl.Status.RequeueState != nil && wl.Status.RequeueState.RequeueAt != nil { | ||
| // Clear the requeue schedule while preserving the Count for historical tracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also want to clear the Count as before. This is because we re-activation is a manual process in spec, and if we don't clear the Count, then such a workload will be automatically put back as Deactivated, which is not good experience. So the user would need to clear the Count on their own.
This makes me think we should also clear admissionChecks.RetryCount fields on deactivation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch! Just to sum up here-> we need to clear: RequeueState.Count, RequeueState.RequeueAt, all AdmissionChecks[].RetryCount fields, and all AdmissionChecks[].RequeueAfterSeconds fields.
Having said that, we can't set RequeueState = nil because that causes the validation error with SSA. We need to
clear the fields individually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having said that, we can't set RequeueState = nil because that causes the validation error with SSA. We need to
clear the fields individually.
Hm, ok, but this is surprising because the pre-existing code would do
if wl.Status.RequeueState != nil {
wl.Status.RequeueState = nil
}
didn't work on the current releases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo Yeah, I made minimal changes to make this work. Let me know how you feel about this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok. Could you please also prepare a small PR with just this fix, and dedicated test.
I think if this didn't work in the past it is something where we should cherrypick the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, actually after you modified the ProvisioningRequest controller the fix is no longer needed. I think the fix was needed because the ProvRequest controller was changing the RequeueState using another admission manager.
51b4465 to
0f11ac7
Compare
Implements delayed retry mechanism for admission checks to prevent overwhelming external controllers and reduce control plane churn. Problem: Previously, when admission checks transitioned to Retry state, Kueue would immediately evict workloads and requeue them, causing excessive load on admission check controllers and unnecessary API server churn, particularly when retry conditions persisted predictably. Solution: This implementation adds two new fields to AdmissionCheckState: - requeueAfterSeconds: Specifies minimum wait time before retry - retryCount: Tracks retry attempts per admission check Key Changes: API Changes - Added requeueAfterSeconds and retryCount fields to AdmissionCheckState - Added +kubebuilder:validation:Minimum=0 to both fields (v1beta1 and v1beta2) Controller Changes - Auto-increments retryCount on transition to Retry state - Calculates maximum retry time across all admission checks - Updates workload.status.requeueState.requeueAt with the maximum delay - Workload controller respects delayed retry times - RequeueAfterSeconds values persist across evictions - Refactored backoff calculation to use wait.NewBackoff() pattern Behavior: When multiple admission checks specify different delays, Kueue uses the maximum delay across all checks. Workloads are evicted immediately to release quota, but admission checks maintain their requeueAfterSeconds values, preventing race conditions where fast-responding checks could block slower ones from registering their delays. Refs: KEP-3258 https://github.com/DataDog/kueue/blob/main/keps/3258-delayed-admission-check-retries/README.md Signed-off-by: Sohan Kunkerkar <[email protected]> Co-authored-by: Daniel Henkel <[email protected]>
This change replaces the use of the shared RequeueState field with the new mechanism in the preprovision request admission check. Signed-off-by: Sohan Kunkerkar <[email protected]> Co-authored-by: Daniel Henkel <[email protected]>
0f11ac7 to
5a7d9a4
Compare
| backoff := wait.NewBackoff(time.Duration(backoffBaseSeconds)*time.Second, time.Duration(backoffMaxSeconds)*time.Second, 2, 0.0001) | ||
| waitDuration := backoff.WaitTime(int(requeuingCount)) | ||
|
|
||
| acState.RequeueAfterSeconds = ptr.To(int32(waitDuration.Truncate(time.Second).Seconds())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also increment the RetryCount for ProvisioningRequest for informational purposes, even if we are not using that information directly.
EDIT: Ah I see it is already bumped here: https://github.com/kubernetes-sigs/kueue/pull/7620/files#diff-c703a3b5f20a31c414c5cafc7df771810f8e39ba1e8658d8b9f897bcb8af659cR95-R107
And this assert shows it gets set: https://github.com/kubernetes-sigs/kueue/pull/7620/files#diff-b524dd7e94e695c9007bf094b1ff9cd5a94952214c0db7de004c7ce5bea7ed1bR924-L924
And then incremented: https://github.com/kubernetes-sigs/kueue/pull/7620/files#diff-b524dd7e94e695c9007bf094b1ff9cd5a94952214c0db7de004c7ce5bea7ed1bL1351-R1368
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Please also address the small follow ups:
|
LGTM label has been added. DetailsGit tree hash: 8081cc7c81df3b68b2123ffc41c4a3c9b508d39d |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, sohankunkerkar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…-sigs#7620) * feat(KEP-3258): implement delayed admission check retries Implements delayed retry mechanism for admission checks to prevent overwhelming external controllers and reduce control plane churn. Problem: Previously, when admission checks transitioned to Retry state, Kueue would immediately evict workloads and requeue them, causing excessive load on admission check controllers and unnecessary API server churn, particularly when retry conditions persisted predictably. Solution: This implementation adds two new fields to AdmissionCheckState: - requeueAfterSeconds: Specifies minimum wait time before retry - retryCount: Tracks retry attempts per admission check Key Changes: API Changes - Added requeueAfterSeconds and retryCount fields to AdmissionCheckState - Added +kubebuilder:validation:Minimum=0 to both fields (v1beta1 and v1beta2) Controller Changes - Auto-increments retryCount on transition to Retry state - Calculates maximum retry time across all admission checks - Updates workload.status.requeueState.requeueAt with the maximum delay - Workload controller respects delayed retry times - RequeueAfterSeconds values persist across evictions - Refactored backoff calculation to use wait.NewBackoff() pattern Behavior: When multiple admission checks specify different delays, Kueue uses the maximum delay across all checks. Workloads are evicted immediately to release quota, but admission checks maintain their requeueAfterSeconds values, preventing race conditions where fast-responding checks could block slower ones from registering their delays. Refs: KEP-3258 https://github.com/DataDog/kueue/blob/main/keps/3258-delayed-admission-check-retries/README.md Signed-off-by: Sohan Kunkerkar <[email protected]> Co-authored-by: Daniel Henkel <[email protected]> * pkg: use delayed ACs in preprovisioning check This change replaces the use of the shared RequeueState field with the new mechanism in the preprovision request admission check. Signed-off-by: Sohan Kunkerkar <[email protected]> Co-authored-by: Daniel Henkel <[email protected]> --------- Signed-off-by: Sohan Kunkerkar <[email protected]> Co-authored-by: Daniel Henkel <[email protected]>
|
/remove-kind api-change /release-note-edit |
Implements delayed retry mechanism for admission checks to prevent overwhelming external controllers and reduce control plane churn.
Problem:
Previously, when admission checks transitioned to Retry state, Kueue would immediately evict workloads and requeue them, causing excessive load on admission check controllers and unnecessary API server churn, particularly when retry conditions persisted predictably.
Solution:
This implementation adds two new fields to AdmissionCheckState:
Key Changes:
API Changes
Controller Changes
Behavior:
When multiple admission checks specify different delays, Kueue uses the maximum delay across all checks. Workloads are evicted immediately to release quota, but admission checks maintain their requeueAfterSeconds values, preventing race conditions where fast-responding checks could block slower ones from registering their delays.
Refs: KEP-3258
https://github.com/kubernetes-sigs/kueue/blob/main/keps/3258-delayed-admission-check-retries/README.md
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #3258
Special notes for your reviewer:
After discussing with @mimowo, taking over #7370 as the original author is on PTO.
Does this PR introduce a user-facing change?