CI flakiness: CloudNativePG postgresql-cluster fails to reach Ready on GKE CI (operator not reconciling)

### Summary

Integration scenarios that provision a per-namespace CloudNativePG `Cluster` (`postgresql-cluster`) intermittently fail because the cluster never reaches `Ready`. Root indicator: the `Cluster` CR applies successfully but the **CNPG operator does not reconcile it** — no `postgresql-cluster-1` instance pod is ever created. This is an infrastructure/operator fault on the shared GKE CI cluster, **not** a chart or test-code bug.

It surfaced as repeated red CI on PR #6318 (CI scenario-registry migration), but #6318 only changes test config + Go matrix loader — the same commit was green the day before.

### Evidence

Workflow `Test - Chart Version`. Same PR commit (#6318): **green 2026-06-03, red repeatedly 2026-06-04**.

- Green only after retries (`run_attempt: 3`): https://github.com/camunda/camunda-platform-helm/actions/runs/26966949424
- Failed runs: https://github.com/camunda/camunda-platform-helm/actions/runs/26959405631 · https://github.com/camunda/camunda-platform-helm/actions/runs/26963855638 · https://github.com/camunda/camunda-platform-helm/actions/runs/26966029910

Two surface signatures, **one root cause** — they differ only in how each pre-install hook reacts to the missing DB:

1. **Fixture-mode** hooks (`eske`/elasticsearch, `keyco`/keycloak-original, `esa0`/auth0) have no readiness gate → `web-modeler-restapi` / `identity` crash-loop on `java.net.UnknownHostException: postgresql-cluster-rw` → `helm upgrade --install --wait --timeout 1200s` fails after ~20 min with the opaque `context deadline exceeded`.
2. **Script-mode** hooks (`osss`/`osot`, opensearch-self-signed) gate with `kubectl wait --for=condition=Ready --timeout=300s cluster.postgresql.cnpg.io/postgresql-cluster` → fail fast at 5 min: `error: timed out waiting for the condition on clusters/postgresql-cluster`.

**Decisive detail** (job `osot`/opensearch-self-signed-os-trust): the `Cluster` CR is `serverside-applied` successfully, then never becomes Ready, and **no `postgresql-cluster-1` instance pod is ever created**. That rules out slow PVC-binding / image-pull (which would leave a Pending pod) — the operator simply was not reconciling the CR.

Provisioning source: the `cnpg` hook applies `charts/camunda-platform-8.10/test/integration/scenarios/common/resources/postgresql-cluster.yaml` (a `postgresql.cnpg.io/v1 Cluster` named `postgresql-cluster`). Components connect to the in-namespace `postgresql-cluster-rw` service the operator is supposed to create.

### Why it looks "flaky"

The same scenario passes on retry / on 06-03 and fails on 06-04 purely because CNPG operator health on the shared CI cluster varies over time. Nothing in the test inputs changed between green and red runs of the same commit.

### Track A — the fix (infra; removes the flake)

- Check the **CloudNativePG operator** pod health (Running, not crash-looping / OOM / rate-limited) on the CI clusters under `*.ci.distro.ultrawombat.com`; confirm `postgresql.cnpg.io` CRDs and the operator webhook are healthy.
- Review operator logs around **2026-06-04 14:00–17:30 UTC** for reconcile errors / leader-election loss.
- Check node capacity / preemptible churn in that window.
- Add operator-health monitoring/alerting and (optionally) a **nightly pre-flight gate** that asserts CNPG is healthy before the matrix fans out — so a bad cluster fails once, fast, with one clear message instead of N scenario timeouts.

### Track B — harness hardening (does NOT remove the flake; makes it fast & diagnosable)

- Give **fixture-mode** hooks the readiness gate that script-mode already has, so `eske`/`keyco`/`esa0` fail in ~5 min with an explicit "CNPG cluster `postgresql-cluster` not Ready" error instead of a 20-min opaque helm timeout. Implement `WaitForCNPGClusterReady(ctx, namespace, name, timeout)` in `scripts/camunda-core/pkg/kube` (existing `Client.dynamicClient`, GVR `postgresql.cnpg.io/v1, clusters`, poll `status.conditions[type=Ready]`) and call it from `scripts/deploy-camunda/deploy/lifecycle.go ApplyLifecycleManifests` (guard with `HasCRD`).
- Dump CNPG operator + `Cluster.status` + namespace pod/event diagnostics on not-Ready.
- Wide blast radius (shared Go tooling): validate green across **all** supported chart versions.

> Note: Track B alone will NOT make these scenarios pass when CNPG is down — the script-mode hooks already wait and still fail. It only converts slow/opaque failures into fast/diagnosable ones. **Track A is required to actually stop the flakiness.**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI flakiness: CloudNativePG postgresql-cluster fails to reach Ready on GKE CI (operator not reconciling) #6338

Summary

Evidence

Why it looks "flaky"

Track A — the fix (infra; removes the flake)

Track B — harness hardening (does NOT remove the flake; makes it fast & diagnosable)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CI flakiness: CloudNativePG postgresql-cluster fails to reach Ready on GKE CI (operator not reconciling) #6338

Description

Summary

Evidence

Why it looks "flaky"

Track A — the fix (infra; removes the flake)

Track B — harness hardening (does NOT remove the flake; makes it fast & diagnosable)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions