CI flakiness: get.helm.sh download timeout + ephemeral-cluster TLS (ERR_CERT_AUTHORITY_INVALID) fail merge queue on green code

## Summary

The `Test - Chart Version` workflow is non-deterministic on GKE CI: the **same commit** passed on the PR run and then failed on the merge-queue run of #6324, with **zero code changes** between them. #6324 (`chore(deploy-camunda): strip digest overlay pins shadowed by --extra-values image overrides`) touches only Go `deploy-camunda` logic + unit tests — it cannot affect helm-binary download or ephemeral-cluster TLS issuance. Both failures below are **infrastructure/environment flakiness**, not regressions.

| Run | Event | Result | Link |
|-----|-------|--------|------|
| Green | `pull_request` | ✅ all green | https://github.com/camunda/camunda-platform-helm/actions/runs/26933163217 |
| Red | `merge_group` | ❌ CI Gate failed | https://github.com/camunda/camunda-platform-helm/actions/runs/27024351392 |

## Flaky failure modes observed

### 1. `curl (28)` — Helm binary download timeout (blocking)
Job: **`8.8 - eske - install`** → step `setup-helm` / install for install on gke.

```
curl -fsSL "https://get.helm.sh/${tarball}" -o "/tmp/${tarball}"
curl: (28) Failed to connect to get.helm.sh port 443 after 133510 ms: Couldn't connect to server
##[error]Process completed with exit code 28.
```

The runner could not reach `get.helm.sh` (external egress) and timed out after ~133s. No retry/backoff — a single transient network/DNS hiccup fails the whole install job. This is a **toolchain bootstrap** flake, fully independent of the chart and the PR.

### 2. `net::ERR_CERT_AUTHORITY_INVALID` — ephemeral-cluster TLS not yet trusted (blocking)
Job: **`8.9 - keorg - install`** → `Playwright e2e after install`. 6 specs failed, **all** with the same error, and the failure **persisted through retry1 and retry2**:

```
Error: page.goto: net::ERR_CERT_AUTHORITY_INVALID
  at https://gke--intg-8-9-gke-keorg.ci.distro.ultrawombat.com/auth
6 failed
[15:48:59] ❌ All Playwright tests failed with code 1
```

The ingress cert for the ephemeral hostname was not yet issued/trusted when Playwright navigated — a classic ACME/cert-manager (or DNS-propagation) provisioning race on freshly-spun ephemeral clusters. Because every spec and every retry hit the same cert error, this is an **environment readiness** problem, not a test-logic flake.

### Non-blocking noise (for completeness)
- `8.7 / 8.10 - shde2 - [shadow] Full e2e` also failed but are `continue-on-error` and do not gate the merge queue (known: shadow e2e is non-blocking).

### Latent bug uncovered (masks diagnostics)
When the install job failed, the `failed-pods-info` action errored instead of dumping pod state:
```
Error: flags cannot be placed before plugin name: -n
##[error]Process completed with exit code 1.
```
The kubectl/helm invocation in that action places `-n` before the subcommand — worth fixing so failure triage isn't blind.

## Why this matters
These flakes gate the **merge queue**, so a perfectly good PR (#6324) was bounced by infra noise after already being green on its PR run. This erodes trust in CI Gate and wastes GKE re-runs.

## Proposed investigation / fixes
1. **Helm download resiliency** — add retries w/ backoff to the helm-binary fetch (or use `azure/setup-helm` with caching / an internal mirror) so a single `get.helm.sh` blip doesn't fail the job. Consider caching the helm binary on the runner image.
2. **Wait for cluster TLS readiness** before E2E — gate the Playwright step on the ingress cert being issued and trusted (poll the cert/Ready condition, or probe `https://<host>/` until the chain validates) instead of navigating immediately. Cross-check overlap with #6338 (CNPG not-Ready) and #6344 (flaky eske/docstr smoke).
3. **Fix `failed-pods-info`** kubectl flag ordering so pod diagnostics actually print on failure.
4. Consider quantifying flake rate (re-run the same SHA N times) to prioritize.

## Repro / evidence
- Green PR run: https://github.com/camunda/camunda-platform-helm/actions/runs/26933163217
- Red merge-queue run: https://github.com/camunda/camunda-platform-helm/actions/runs/27024351392/job/79762932789
- PR: https://github.com/camunda/camunda-platform-helm/pull/6324 (merged; same code in both runs)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI flakiness: get.helm.sh download timeout + ephemeral-cluster TLS (ERR_CERT_AUTHORITY_INVALID) fail merge queue on green code #6348

Summary

Flaky failure modes observed

1. `curl (28)` — Helm binary download timeout (blocking)

2. `net::ERR_CERT_AUTHORITY_INVALID` — ephemeral-cluster TLS not yet trusted (blocking)

Non-blocking noise (for completeness)

Latent bug uncovered (masks diagnostics)

Why this matters

Proposed investigation / fixes

Repro / evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run	Event	Result	Link
Green	`pull_request`	✅ all green	https://github.com/camunda/camunda-platform-helm/actions/runs/26933163217
Red	`merge_group`	❌ CI Gate failed	https://github.com/camunda/camunda-platform-helm/actions/runs/27024351392

CI flakiness: get.helm.sh download timeout + ephemeral-cluster TLS (ERR_CERT_AUTHORITY_INVALID) fail merge queue on green code #6348

Description

Summary

Flaky failure modes observed

1. curl (28) — Helm binary download timeout (blocking)

2. net::ERR_CERT_AUTHORITY_INVALID — ephemeral-cluster TLS not yet trusted (blocking)

Non-blocking noise (for completeness)

Latent bug uncovered (masks diagnostics)

Why this matters

Proposed investigation / fixes

Repro / evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `curl (28)` — Helm binary download timeout (blocking)

2. `net::ERR_CERT_AUTHORITY_INVALID` — ephemeral-cluster TLS not yet trusted (blocking)