You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Test - Chart Version workflow is non-deterministic on GKE CI: the same commit passed on the PR run and then failed on the merge-queue run of #6324, with zero code changes between them. #6324 (chore(deploy-camunda): strip digest overlay pins shadowed by --extra-values image overrides) touches only Go deploy-camunda logic + unit tests — it cannot affect helm-binary download or ephemeral-cluster TLS issuance. Both failures below are infrastructure/environment flakiness, not regressions.
Job: 8.8 - eske - install → step setup-helm / install for install on gke.
curl -fsSL "https://get.helm.sh/${tarball}" -o "/tmp/${tarball}"
curl: (28) Failed to connect to get.helm.sh port 443 after 133510 ms: Couldn't connect to server
##[error]Process completed with exit code 28.
The runner could not reach get.helm.sh (external egress) and timed out after ~133s. No retry/backoff — a single transient network/DNS hiccup fails the whole install job. This is a toolchain bootstrap flake, fully independent of the chart and the PR.
2. net::ERR_CERT_AUTHORITY_INVALID — ephemeral-cluster TLS not yet trusted (blocking)
Job: 8.9 - keorg - install → Playwright e2e after install. 6 specs failed, all with the same error, and the failure persisted through retry1 and retry2:
Error: page.goto: net::ERR_CERT_AUTHORITY_INVALID
at https://gke--intg-8-9-gke-keorg.ci.distro.ultrawombat.com/auth
6 failed
[15:48:59] ❌ All Playwright tests failed with code 1
The ingress cert for the ephemeral hostname was not yet issued/trusted when Playwright navigated — a classic ACME/cert-manager (or DNS-propagation) provisioning race on freshly-spun ephemeral clusters. Because every spec and every retry hit the same cert error, this is an environment readiness problem, not a test-logic flake.
Non-blocking noise (for completeness)
8.7 / 8.10 - shde2 - [shadow] Full e2e also failed but are continue-on-error and do not gate the merge queue (known: shadow e2e is non-blocking).
Latent bug uncovered (masks diagnostics)
When the install job failed, the failed-pods-info action errored instead of dumping pod state:
Error: flags cannot be placed before plugin name: -n
##[error]Process completed with exit code 1.
The kubectl/helm invocation in that action places -n before the subcommand — worth fixing so failure triage isn't blind.
Why this matters
These flakes gate the merge queue, so a perfectly good PR (#6324) was bounced by infra noise after already being green on its PR run. This erodes trust in CI Gate and wastes GKE re-runs.
Proposed investigation / fixes
Helm download resiliency — add retries w/ backoff to the helm-binary fetch (or use azure/setup-helm with caching / an internal mirror) so a single get.helm.sh blip doesn't fail the job. Consider caching the helm binary on the runner image.
Summary
The
Test - Chart Versionworkflow is non-deterministic on GKE CI: the same commit passed on the PR run and then failed on the merge-queue run of #6324, with zero code changes between them. #6324 (chore(deploy-camunda): strip digest overlay pins shadowed by --extra-values image overrides) touches only Godeploy-camundalogic + unit tests — it cannot affect helm-binary download or ephemeral-cluster TLS issuance. Both failures below are infrastructure/environment flakiness, not regressions.pull_requestmerge_groupFlaky failure modes observed
1.
curl (28)— Helm binary download timeout (blocking)Job:
8.8 - eske - install→ stepsetup-helm/ install for install on gke.The runner could not reach
get.helm.sh(external egress) and timed out after ~133s. No retry/backoff — a single transient network/DNS hiccup fails the whole install job. This is a toolchain bootstrap flake, fully independent of the chart and the PR.2.
net::ERR_CERT_AUTHORITY_INVALID— ephemeral-cluster TLS not yet trusted (blocking)Job:
8.9 - keorg - install→Playwright e2e after install. 6 specs failed, all with the same error, and the failure persisted through retry1 and retry2:The ingress cert for the ephemeral hostname was not yet issued/trusted when Playwright navigated — a classic ACME/cert-manager (or DNS-propagation) provisioning race on freshly-spun ephemeral clusters. Because every spec and every retry hit the same cert error, this is an environment readiness problem, not a test-logic flake.
Non-blocking noise (for completeness)
8.7 / 8.10 - shde2 - [shadow] Full e2ealso failed but arecontinue-on-errorand do not gate the merge queue (known: shadow e2e is non-blocking).Latent bug uncovered (masks diagnostics)
When the install job failed, the
failed-pods-infoaction errored instead of dumping pod state:The kubectl/helm invocation in that action places
-nbefore the subcommand — worth fixing so failure triage isn't blind.Why this matters
These flakes gate the merge queue, so a perfectly good PR (#6324) was bounced by infra noise after already being green on its PR run. This erodes trust in CI Gate and wastes GKE re-runs.
Proposed investigation / fixes
azure/setup-helmwith caching / an internal mirror) so a singleget.helm.shblip doesn't fail the job. Consider caching the helm binary on the runner image.https://<host>/until the chain validates) instead of navigating immediately. Cross-check overlap with CI flakiness: CloudNativePG postgresql-cluster fails to reach Ready on GKE CI (operator not reconciling) #6338 (CNPG not-Ready) and test(8.8): investigate flaky eske + docstr Playwright smoke tests #6344 (flaky eske/docstr smoke).failed-pods-infokubectl flag ordering so pod diagnostics actually print on failure.Repro / evidence