Skip to content

feat: add configurable grace period for transient CreateContainerError#10326

Open
ab-ghosh wants to merge 1 commit into
tektoncd:mainfrom
ab-ghosh:feat/create-container-error-timeout
Open

feat: add configurable grace period for transient CreateContainerError#10326
ab-ghosh wants to merge 1 commit into
tektoncd:mainfrom
ab-ghosh:feat/create-container-error-timeout

Conversation

@ab-ghosh

Copy link
Copy Markdown
Member

Changes

When CRI-O is under heavy load, it may fail to create a container within the kubelet's RuntimeRequestTimeout (default 2 minutes), resulting in CreateContainerError or CreateContainerConfigError with "context deadline exceeded". This is a transient error, the container runtime will succeed once its queue clears. Currently, it fails the TaskRun immediately on these errors.
This PR adds a new configurable grace period default-create-container-error-timeout in config-defaults that allows the TaskRun controller to wait for the container runtime to recover instead of failing immediately. The grace period is measured from the pod's PodInitialized / PodReadyToStartContainers condition timestamp.

  • Default is 0 (fail fast), preserving existing behavior
  • Only applies when the error message contains "context deadline exceeded" — other CreateContainerConfigError cases (missing ConfigMap, bad env vars) still fail immediately

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Add `default-create-container-error-timeout` configuration option in `config-defaults` to provide a grace period before failing TaskRuns on transient `CreateContainerError`/`CreateContainerConfigError` with "context deadline exceeded". Default is 0 (fail fast, preserving existing behavior)

@tekton-robot tekton-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jun 22, 2026
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 22, 2026
@tekton-robot

Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign khrm after the PR has been reviewed.
You can assign the PR to them by writing /assign @khrm in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

  When CRI-O is under heavy load, it may fail to create a container within
  the kubelet's RuntimeRequestTimeout, resulting in CreateContainerError or
  CreateContainerConfigError with 'context deadline exceeded'. This is a
  transient error that resolves once CRI-O's queue clears.

  Add a new config option 'default-create-container-error-timeout' that
  allows the TaskRun controller to wait for the container runtime to
  recover instead of failing immediately. Default is 0 (fail fast),
  preserving existing behavior.

fix: use %w to wrap error in CreateContainerError config parsing
@ab-ghosh ab-ghosh force-pushed the feat/create-container-error-timeout branch from fd3a620 to 8619dfc Compare June 22, 2026 14:17
@ab-ghosh

Copy link
Copy Markdown
Member Author

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 22, 2026
@vdemeester vdemeester self-assigned this Jun 22, 2026
@vdemeester vdemeester requested a review from Copilot June 22, 2026 14:20

@vdemeester vdemeester left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, on note, it's not super clear in the docs, that the default value is 0m and with 0m, it means fails instantly. We may want to highlight this.


DefaultImagePullBackOffTimeout = 0 * time.Minute

DefaultCreateContainerErrorTimeout = 0 * time.Minute

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by default we have no default and thus we keep the current behavior (fails instantly).

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new config-defaults option to let the TaskRun reconciler tolerate transient container-runtime timeouts (CreateContainerError / CreateContainerConfigError with "context deadline exceeded") by waiting up to a configurable grace period before failing the TaskRun—intended to help when runtimes like CRI-O are under heavy load.

Changes:

  • Add default-create-container-error-timeout to defaults config parsing and defaults struct.
  • Extend TaskRun pod-failure handling to apply a grace period for transient context deadline exceeded container create errors.
  • Add reconciler tests plus user-facing documentation and example config comments.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/reconciler/taskrun/taskrun.go Adds grace-period logic when container creation fails with context deadline exceeded.
pkg/reconciler/taskrun/taskrun_test.go Adds unit coverage for the new grace-period behavior.
pkg/apis/config/default.go Introduces DefaultCreateContainerErrorTimeout and parses the new config-defaults key.
pkg/apis/config/default_test.go Updates defaults/equals expectations to include the new field.
docs/additional-configs.md Documents the new default-create-container-error-timeout option.
config/config-defaults.yaml Adds commented example/docs for the new config-defaults key.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +361 to +364
createContainerErrorTimeout := config.FromContextOrDefaults(ctx).Defaults.DefaultCreateContainerErrorTimeout
if createContainerErrorTimeout != 0 {
p, err := c.podLister.Pods(tr.Namespace).Get(tr.Status.PodName)
if err != nil {
Comment on lines +368 to +372
podConditions := []string{string(corev1.PodInitialized), "PodReadyToStartContainers"}
for _, condition := range p.Status.Conditions {
if slices.Contains(podConditions, string(condition.Type)) {
if c.Clock.Since(condition.LastTransitionTime.Time) < createContainerErrorTimeout {
return false, "", ""
Comment on lines +3402 to +3405
err := testAssets.Controller.Reconciler.Reconcile(testAssets.Ctx, getRunName(taskRun))
if err == nil {
t.Errorf("expected error when reconciling TaskRun with transient container error: %v", err)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants