Summary
Integration scenarios that provision a per-namespace CloudNativePG Cluster (postgresql-cluster) intermittently fail because the cluster never reaches Ready. Root indicator: the Cluster CR applies successfully but the CNPG operator does not reconcile it — no postgresql-cluster-1 instance pod is ever created. This is an infrastructure/operator fault on the shared GKE CI cluster, not a chart or test-code bug.
It surfaced as repeated red CI on PR #6318 (CI scenario-registry migration), but #6318 only changes test config + Go matrix loader — the same commit was green the day before.
Evidence
Workflow Test - Chart Version. Same PR commit (#6318): green 2026-06-03, red repeatedly 2026-06-04.
Two surface signatures, one root cause — they differ only in how each pre-install hook reacts to the missing DB:
- Fixture-mode hooks (
eske/elasticsearch, keyco/keycloak-original, esa0/auth0) have no readiness gate → web-modeler-restapi / identity crash-loop on java.net.UnknownHostException: postgresql-cluster-rw → helm upgrade --install --wait --timeout 1200s fails after ~20 min with the opaque context deadline exceeded.
- Script-mode hooks (
osss/osot, opensearch-self-signed) gate with kubectl wait --for=condition=Ready --timeout=300s cluster.postgresql.cnpg.io/postgresql-cluster → fail fast at 5 min: error: timed out waiting for the condition on clusters/postgresql-cluster.
Decisive detail (job osot/opensearch-self-signed-os-trust): the Cluster CR is serverside-applied successfully, then never becomes Ready, and no postgresql-cluster-1 instance pod is ever created. That rules out slow PVC-binding / image-pull (which would leave a Pending pod) — the operator simply was not reconciling the CR.
Provisioning source: the cnpg hook applies charts/camunda-platform-8.10/test/integration/scenarios/common/resources/postgresql-cluster.yaml (a postgresql.cnpg.io/v1 Cluster named postgresql-cluster). Components connect to the in-namespace postgresql-cluster-rw service the operator is supposed to create.
Why it looks "flaky"
The same scenario passes on retry / on 06-03 and fails on 06-04 purely because CNPG operator health on the shared CI cluster varies over time. Nothing in the test inputs changed between green and red runs of the same commit.
Track A — the fix (infra; removes the flake)
- Check the CloudNativePG operator pod health (Running, not crash-looping / OOM / rate-limited) on the CI clusters under
*.ci.distro.ultrawombat.com; confirm postgresql.cnpg.io CRDs and the operator webhook are healthy.
- Review operator logs around 2026-06-04 14:00–17:30 UTC for reconcile errors / leader-election loss.
- Check node capacity / preemptible churn in that window.
- Add operator-health monitoring/alerting and (optionally) a nightly pre-flight gate that asserts CNPG is healthy before the matrix fans out — so a bad cluster fails once, fast, with one clear message instead of N scenario timeouts.
Track B — harness hardening (does NOT remove the flake; makes it fast & diagnosable)
- Give fixture-mode hooks the readiness gate that script-mode already has, so
eske/keyco/esa0 fail in ~5 min with an explicit "CNPG cluster postgresql-cluster not Ready" error instead of a 20-min opaque helm timeout. Implement WaitForCNPGClusterReady(ctx, namespace, name, timeout) in scripts/camunda-core/pkg/kube (existing Client.dynamicClient, GVR postgresql.cnpg.io/v1, clusters, poll status.conditions[type=Ready]) and call it from scripts/deploy-camunda/deploy/lifecycle.go ApplyLifecycleManifests (guard with HasCRD).
- Dump CNPG operator +
Cluster.status + namespace pod/event diagnostics on not-Ready.
- Wide blast radius (shared Go tooling): validate green across all supported chart versions.
Note: Track B alone will NOT make these scenarios pass when CNPG is down — the script-mode hooks already wait and still fail. It only converts slow/opaque failures into fast/diagnosable ones. Track A is required to actually stop the flakiness.
Summary
Integration scenarios that provision a per-namespace CloudNativePG
Cluster(postgresql-cluster) intermittently fail because the cluster never reachesReady. Root indicator: theClusterCR applies successfully but the CNPG operator does not reconcile it — nopostgresql-cluster-1instance pod is ever created. This is an infrastructure/operator fault on the shared GKE CI cluster, not a chart or test-code bug.It surfaced as repeated red CI on PR #6318 (CI scenario-registry migration), but #6318 only changes test config + Go matrix loader — the same commit was green the day before.
Evidence
Workflow
Test - Chart Version. Same PR commit (#6318): green 2026-06-03, red repeatedly 2026-06-04.run_attempt: 3): https://github.com/camunda/camunda-platform-helm/actions/runs/26966949424Two surface signatures, one root cause — they differ only in how each pre-install hook reacts to the missing DB:
eske/elasticsearch,keyco/keycloak-original,esa0/auth0) have no readiness gate →web-modeler-restapi/identitycrash-loop onjava.net.UnknownHostException: postgresql-cluster-rw→helm upgrade --install --wait --timeout 1200sfails after ~20 min with the opaquecontext deadline exceeded.osss/osot, opensearch-self-signed) gate withkubectl wait --for=condition=Ready --timeout=300s cluster.postgresql.cnpg.io/postgresql-cluster→ fail fast at 5 min:error: timed out waiting for the condition on clusters/postgresql-cluster.Decisive detail (job
osot/opensearch-self-signed-os-trust): theClusterCR isserverside-appliedsuccessfully, then never becomes Ready, and nopostgresql-cluster-1instance pod is ever created. That rules out slow PVC-binding / image-pull (which would leave a Pending pod) — the operator simply was not reconciling the CR.Provisioning source: the
cnpghook appliescharts/camunda-platform-8.10/test/integration/scenarios/common/resources/postgresql-cluster.yaml(apostgresql.cnpg.io/v1 Clusternamedpostgresql-cluster). Components connect to the in-namespacepostgresql-cluster-rwservice the operator is supposed to create.Why it looks "flaky"
The same scenario passes on retry / on 06-03 and fails on 06-04 purely because CNPG operator health on the shared CI cluster varies over time. Nothing in the test inputs changed between green and red runs of the same commit.
Track A — the fix (infra; removes the flake)
*.ci.distro.ultrawombat.com; confirmpostgresql.cnpg.ioCRDs and the operator webhook are healthy.Track B — harness hardening (does NOT remove the flake; makes it fast & diagnosable)
eske/keyco/esa0fail in ~5 min with an explicit "CNPG clusterpostgresql-clusternot Ready" error instead of a 20-min opaque helm timeout. ImplementWaitForCNPGClusterReady(ctx, namespace, name, timeout)inscripts/camunda-core/pkg/kube(existingClient.dynamicClient, GVRpostgresql.cnpg.io/v1, clusters, pollstatus.conditions[type=Ready]) and call it fromscripts/deploy-camunda/deploy/lifecycle.go ApplyLifecycleManifests(guard withHasCRD).Cluster.status+ namespace pod/event diagnostics on not-Ready.