Skip to content

CI flakiness: get.helm.sh download timeout + ephemeral-cluster TLS (ERR_CERT_AUTHORITY_INVALID) fail merge queue on green code #6348

@eamonnmoloney

Description

@eamonnmoloney

Summary

The Test - Chart Version workflow is non-deterministic on GKE CI: the same commit passed on the PR run and then failed on the merge-queue run of #6324, with zero code changes between them. #6324 (chore(deploy-camunda): strip digest overlay pins shadowed by --extra-values image overrides) touches only Go deploy-camunda logic + unit tests — it cannot affect helm-binary download or ephemeral-cluster TLS issuance. Both failures below are infrastructure/environment flakiness, not regressions.

Run Event Result Link
Green pull_request ✅ all green https://github.com/camunda/camunda-platform-helm/actions/runs/26933163217
Red merge_group ❌ CI Gate failed https://github.com/camunda/camunda-platform-helm/actions/runs/27024351392

Flaky failure modes observed

1. curl (28) — Helm binary download timeout (blocking)

Job: 8.8 - eske - install → step setup-helm / install for install on gke.

curl -fsSL "https://get.helm.sh/${tarball}" -o "/tmp/${tarball}"
curl: (28) Failed to connect to get.helm.sh port 443 after 133510 ms: Couldn't connect to server
##[error]Process completed with exit code 28.

The runner could not reach get.helm.sh (external egress) and timed out after ~133s. No retry/backoff — a single transient network/DNS hiccup fails the whole install job. This is a toolchain bootstrap flake, fully independent of the chart and the PR.

2. net::ERR_CERT_AUTHORITY_INVALID — ephemeral-cluster TLS not yet trusted (blocking)

Job: 8.9 - keorg - installPlaywright e2e after install. 6 specs failed, all with the same error, and the failure persisted through retry1 and retry2:

Error: page.goto: net::ERR_CERT_AUTHORITY_INVALID
  at https://gke--intg-8-9-gke-keorg.ci.distro.ultrawombat.com/auth
6 failed
[15:48:59] ❌ All Playwright tests failed with code 1

The ingress cert for the ephemeral hostname was not yet issued/trusted when Playwright navigated — a classic ACME/cert-manager (or DNS-propagation) provisioning race on freshly-spun ephemeral clusters. Because every spec and every retry hit the same cert error, this is an environment readiness problem, not a test-logic flake.

Non-blocking noise (for completeness)

  • 8.7 / 8.10 - shde2 - [shadow] Full e2e also failed but are continue-on-error and do not gate the merge queue (known: shadow e2e is non-blocking).

Latent bug uncovered (masks diagnostics)

When the install job failed, the failed-pods-info action errored instead of dumping pod state:

Error: flags cannot be placed before plugin name: -n
##[error]Process completed with exit code 1.

The kubectl/helm invocation in that action places -n before the subcommand — worth fixing so failure triage isn't blind.

Why this matters

These flakes gate the merge queue, so a perfectly good PR (#6324) was bounced by infra noise after already being green on its PR run. This erodes trust in CI Gate and wastes GKE re-runs.

Proposed investigation / fixes

  1. Helm download resiliency — add retries w/ backoff to the helm-binary fetch (or use azure/setup-helm with caching / an internal mirror) so a single get.helm.sh blip doesn't fail the job. Consider caching the helm binary on the runner image.
  2. Wait for cluster TLS readiness before E2E — gate the Playwright step on the ingress cert being issued and trusted (poll the cert/Ready condition, or probe https://<host>/ until the chain validates) instead of navigating immediately. Cross-check overlap with CI flakiness: CloudNativePG postgresql-cluster fails to reach Ready on GKE CI (operator not reconciling) #6338 (CNPG not-Ready) and test(8.8): investigate flaky eske + docstr Playwright smoke tests #6344 (flaky eske/docstr smoke).
  3. Fix failed-pods-info kubectl flag ordering so pod diagnostics actually print on failure.
  4. Consider quantifying flake rate (re-run the same SHA N times) to prioritize.

Repro / evidence

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/ciarea/testMarks an issue as improving or extending the tests of the projectcomponent/helmkind/bugSomething isn't working as intendedlikelihood/highA recurring issueseverity/midMarks a bug as having a noticeable impact but with a known workaroundtriage:completed

    Type

    No type

    Urgency

    next

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions