Skip to content

Patch release automation fails under concurrency: publish-images tasks OOM / hit 60m timeout #3480

Description

@vdemeester

Summary

When multiple release-v* patch releases are triggered close together (e.g. a batched patch-release.yaml / payload bump across all maintained branches), the operator release automation reliably fails. On 2026-06-04, all 5 patch releases (v0.74.2, v0.76.1, v0.77.1, v0.78.2, v0.79.2) failed.

Symptoms

All 5 runs passed precheckunit-testsbuild-test, then failed on the two publish-images tasks:

  • 9/10 publish-images-platform-{kubernetes,openshift} tasks → TaskRunTimeout at exactly 60 min
  • 1/10 → InitContainerOOM (injected prepare init container, exit 137 — node-level memory pressure)

Root cause

  1. No concurrency_limit on the Repository CR (tektoncd-operator, in plumbing), so PAC runs all matched patch PipelineRuns simultaneously.
  2. Each run launches 2 multi-arch ko builds (linux/amd64,arm64,s390x,ppc64le, non-amd64 via qemu emulation) → 10 heavy emulated builds in parallel on a 4-node (~48 vCPU / ~96 GB) cluster → CPU/memory saturation.
  3. The PipelineRun sets timeouts.pipeline: 3h but no per-task timeout, so each TaskRun inherits the cluster default default-timeout-minutes: 60 — too short for a starved emulated build. Hence the uniform 1h timeouts.

Proposed fix

  • plumbing — add concurrency_limit to the operator Repository CR (repositories/operator.yaml) to serialize patch releases (e.g. 1 or 2). Tracked separately in plumbing; will cross-link.
  • operator — add a per-task timeouts.tasks (or a taskRunSpecs timeout) for the publish tasks in .tekton/release-patch.yaml (e.g. 90m–2h) so a single build fits within the 3h pipeline budget.
  • (optional) stagger the .github/workflows/patch-release.yaml trigger; add memory requests to the run-kustomize-ko step in tekton/build-publish-images-manifests.yaml so the scheduler spreads builds across nodes.

Evidence

  • PipelineRuns release-patch-{hxmtj,l6d9v,nbjmp,v6b8q,gr2tm} in namespace releases-operator, 2026-06-04 ~13:00–14:00.
  • 9/10 publish TaskRuns failed TaskRunTimeout exactly 60 min after start; 1 failed InitContainerOOM (prepare init container, exit 137).

Notes

This is independent of the separate versionTag CEL bug affecting the minor/initial release path (.tekton/release.yaml), which requires PAC ≥ v0.47.0 for the cel: ... .replace(...) expression to evaluate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions