feat: add configurable grace period for transient CreateContainerError#10326
feat: add configurable grace period for transient CreateContainerError#10326ab-ghosh wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
When CRI-O is under heavy load, it may fail to create a container within the kubelet's RuntimeRequestTimeout, resulting in CreateContainerError or CreateContainerConfigError with 'context deadline exceeded'. This is a transient error that resolves once CRI-O's queue clears. Add a new config option 'default-create-container-error-timeout' that allows the TaskRun controller to wait for the container runtime to recover instead of failing immediately. Default is 0 (fail fast), preserving existing behavior. fix: use %w to wrap error in CreateContainerError config parsing
fd3a620 to
8619dfc
Compare
|
/kind feature |
vdemeester
left a comment
There was a problem hiding this comment.
Looks good, on note, it's not super clear in the docs, that the default value is 0m and with 0m, it means fails instantly. We may want to highlight this.
|
|
||
| DefaultImagePullBackOffTimeout = 0 * time.Minute | ||
|
|
||
| DefaultCreateContainerErrorTimeout = 0 * time.Minute |
There was a problem hiding this comment.
So by default we have no default and thus we keep the current behavior (fails instantly).
There was a problem hiding this comment.
Pull request overview
Adds a new config-defaults option to let the TaskRun reconciler tolerate transient container-runtime timeouts (CreateContainerError / CreateContainerConfigError with "context deadline exceeded") by waiting up to a configurable grace period before failing the TaskRun—intended to help when runtimes like CRI-O are under heavy load.
Changes:
- Add
default-create-container-error-timeoutto defaults config parsing and defaults struct. - Extend TaskRun pod-failure handling to apply a grace period for transient
context deadline exceededcontainer create errors. - Add reconciler tests plus user-facing documentation and example config comments.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/reconciler/taskrun/taskrun.go | Adds grace-period logic when container creation fails with context deadline exceeded. |
| pkg/reconciler/taskrun/taskrun_test.go | Adds unit coverage for the new grace-period behavior. |
| pkg/apis/config/default.go | Introduces DefaultCreateContainerErrorTimeout and parses the new config-defaults key. |
| pkg/apis/config/default_test.go | Updates defaults/equals expectations to include the new field. |
| docs/additional-configs.md | Documents the new default-create-container-error-timeout option. |
| config/config-defaults.yaml | Adds commented example/docs for the new config-defaults key. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| createContainerErrorTimeout := config.FromContextOrDefaults(ctx).Defaults.DefaultCreateContainerErrorTimeout | ||
| if createContainerErrorTimeout != 0 { | ||
| p, err := c.podLister.Pods(tr.Namespace).Get(tr.Status.PodName) | ||
| if err != nil { |
| podConditions := []string{string(corev1.PodInitialized), "PodReadyToStartContainers"} | ||
| for _, condition := range p.Status.Conditions { | ||
| if slices.Contains(podConditions, string(condition.Type)) { | ||
| if c.Clock.Since(condition.LastTransitionTime.Time) < createContainerErrorTimeout { | ||
| return false, "", "" |
| err := testAssets.Controller.Reconciler.Reconcile(testAssets.Ctx, getRunName(taskRun)) | ||
| if err == nil { | ||
| t.Errorf("expected error when reconciling TaskRun with transient container error: %v", err) | ||
| } |
Changes
When CRI-O is under heavy load, it may fail to create a container within the kubelet's RuntimeRequestTimeout (default 2 minutes), resulting in
CreateContainerErrororCreateContainerConfigErrorwith"context deadline exceeded". This is a transient error, the container runtime will succeed once its queue clears. Currently, it fails the TaskRun immediately on these errors.This PR adds a new configurable grace period
default-create-container-error-timeoutinconfig-defaultsthat allows the TaskRun controller to wait for the container runtime to recover instead of failing immediately. The grace period is measured from the pod'sPodInitialized/PodReadyToStartContainerscondition timestamp.0(fail fast), preserving existing behavior"context deadline exceeded"— otherCreateContainerConfigErrorcases (missing ConfigMap, bad env vars) still fail immediatelySubmitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes