-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3322: add a new field maxRestartTimesOnFailure to podSpec #3339
base: master
Are you sure you want to change the base?
KEP-3322: add a new field maxRestartTimesOnFailure to podSpec #3339
Conversation
af1ce55
to
d2e8b5d
Compare
d2e8b5d
to
affd244
Compare
525ea53
to
c52de33
Compare
cc @wojtek-t PTAL, thanks a lot. |
We're generally doing PRR once you already have SIG approval. |
cc @dchen1107 for sig-node side review, also cc @hex108 |
ba886ba
to
b2c9f56
Compare
This KEP is helpful especially for those pods that holds a large resource set such as the JVM based pod . We give these kinds of pods a high limit threshold to speed up their startup , restart always policy will make this worse , even the node crash. In the old days , daemon control tools like supervisorctl has its startretries mechanism to limit the max startup retries , but for k8s deployments there is no replacement for it . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is no SIG-level agreement on it - I made a quick pass and added some comments, but please ping me only once you have SIG-level approval.
|
||
Pros: | ||
* BackoffLimitPerIndex can reuse this functionality and no longer need to consider the restart times per index. | ||
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree with Aldo.
It's not an implementation detail - it's a fundamental thing of "pod recreations". If we want to track something across pod recreations (which is the case for jobs), maxRestarts won't solve it on its own - but actually may help with it.
nitty-gritty. | ||
--> | ||
|
||
Add a new field `maxRestartTimesOnFailure` to the `pod.spec`. `maxRestartTimesOnFailure` only applies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Tim - I think that generalizing it to "Always" is natural and instead of making the API narrow, let's make it more generic.
will rollout across nodes. | ||
--> | ||
|
||
Because kubelet will not upgrade/downgrade until api-server ready, so this will not impact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand it.
FWUW - this section isn't necessary for Alpha so given time bounds you may want to delete your answers from rollout and monitoring sections as their answers are controversial...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean here is when upgrading api servers, we'll wait until all the apiservers are ready, then upgrade the kubelet. So if feature gates are enabled on some apiservers, we'll do nothing. Is this reasonable? Or what we want here is all the possibilities not the best practices, since it said as paranoid as possible
. cc @wojtek-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed for alpha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just comment out this section.
reviewers: | ||
- TBD | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to find an approver. Without approver defined we unlikely can take it.
@mrunalp @derekwaynecarr @dchen1107 any of you want to take it?
``` | ||
|
||
- If Pod's `RestartPolicy` is `Always` or `Never`, `maxRestartTimesOnFailure` is default to nil, and will not apply. | ||
- If Pod's `RestartPolicy` is `OnFailure`, `maxRestartTimesOnFailure` is also default to nil, which means infinite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two questions
- if Pod's
RestartPolicy
isOnFailure
andmaxRestartTimesOnFailure
is 0, invalid? or meansNever
? - is the
maxRestartTimesOnFailure
editable for the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO:
- if
restartPolicy
is "OnFailure" andmaxRestarts
is 0, it is effectively "Never". I don't think we need to special-case 0 to be a failure, but I don't feel strongly and could be argued either way. - let's start with "no" and see if there's really a need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it feels like never to me too.
Deadline is in ~8 hours -- Is this still hoping to land? |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: drinktee, kerthcet The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
Pros: | ||
* BackoffLimitPerIndex can reuse this functionality and no longer need to consider the restart times per index. | ||
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we could end up in another scenario where features work independently, but when paired together are mostly useless or have very strange semantics that are hard for users to understand, and hard to maintain.
I like this analysis.
Pros: | ||
* Reduce the maintenance cost of Job API | ||
|
||
Cons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's not the same pod. Job is a higher-level contruct.
Yes, it feels like a conflation of a pod-level maxRestarts
and a job-level maxRecreates
or something.
``` | ||
|
||
- If Pod's `RestartPolicy` is `Always` or `Never`, `maxRestartTimesOnFailure` is default to nil, and will not apply. | ||
- If Pod's `RestartPolicy` is `OnFailure`, `maxRestartTimesOnFailure` is also default to nil, which means infinite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it feels like never to me too.
restart times for backwards compatibility. | ||
|
||
In runtime, we'll check the sum of `RestartCount` of | ||
all containers [`Status`](https://github.com/kubernetes/kubernetes/blob/451e1fa8bcff38b87504eebd414948e505526d01/pkg/kubelet/container/runtime.go#L306-L335) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod.spec.containers[].maxRestarts
reads well to me.
But it's a little weird because Containers also have a RestartPolicy
and the only allowed value is Always
. It would stop being intuitive how the attribute interactions actually work because now we're intermingling pod level attributes with container level attributes.
CRI or CNI may require updating that component before the kubelet. | ||
--> | ||
|
||
The kubelet version should be consistent with the api-server version, or this feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used?
In other words, if we have a kubelet at 1.26 and a kube-apiserver at 1.29 with the feature enabled, what is the expected behavior?
Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet.
I believe the answer to this should be no.
|
||
When we set the restartPolicy=OnFailure and set a specific maxRestartTimesOnFailure number, | ||
but Pod restarts times is not equal to the number. | ||
Or we can refer to the metric `pod_exceed_restart_times_size` for comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new or existing metric?
will rollout across nodes. | ||
--> | ||
|
||
Because kubelet will not upgrade/downgrade until api-server ready, so this will not impact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just comment out this section.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
hey, it's 2024, any update? /lifecycle frozen |
@adampl: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
@kerthcet Are you going to work on this? |
FYI on a somewhat related KEP #4603 |
@alculquicondor I'm trying to see if this is a KEP that sig-node should help review for 1.32. |
@lauralorenz presented the plan for CrashLoopBackOff. I think that any kind of capping max restart time is out of scope for #4603. |
Not yet, if you're interested, you can take it over, I have other works trapping me right now. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
I'm not convinced that this should be a Kubelet & pod level concern, but whether or not we eventually want it on the pod it seems like a good candidate for prototyping out-of-tree. High-level idea would be a controller that watches containers for restart, and deletes the pod when a policy deems it necessary (is deletion sufficient to mark a job as failed?). API could use annotations and/or CRD. |
Signed-off-by: kerthcet [email protected]