Skip to content

CI flakiness: CloudNativePG postgresql-cluster fails to reach Ready on GKE CI (operator not reconciling) #6338

@eamonnmoloney

Description

@eamonnmoloney

Summary

Integration scenarios that provision a per-namespace CloudNativePG Cluster (postgresql-cluster) intermittently fail because the cluster never reaches Ready. Root indicator: the Cluster CR applies successfully but the CNPG operator does not reconcile it — no postgresql-cluster-1 instance pod is ever created. This is an infrastructure/operator fault on the shared GKE CI cluster, not a chart or test-code bug.

It surfaced as repeated red CI on PR #6318 (CI scenario-registry migration), but #6318 only changes test config + Go matrix loader — the same commit was green the day before.

Evidence

Workflow Test - Chart Version. Same PR commit (#6318): green 2026-06-03, red repeatedly 2026-06-04.

Two surface signatures, one root cause — they differ only in how each pre-install hook reacts to the missing DB:

  1. Fixture-mode hooks (eske/elasticsearch, keyco/keycloak-original, esa0/auth0) have no readiness gate → web-modeler-restapi / identity crash-loop on java.net.UnknownHostException: postgresql-cluster-rwhelm upgrade --install --wait --timeout 1200s fails after ~20 min with the opaque context deadline exceeded.
  2. Script-mode hooks (osss/osot, opensearch-self-signed) gate with kubectl wait --for=condition=Ready --timeout=300s cluster.postgresql.cnpg.io/postgresql-cluster → fail fast at 5 min: error: timed out waiting for the condition on clusters/postgresql-cluster.

Decisive detail (job osot/opensearch-self-signed-os-trust): the Cluster CR is serverside-applied successfully, then never becomes Ready, and no postgresql-cluster-1 instance pod is ever created. That rules out slow PVC-binding / image-pull (which would leave a Pending pod) — the operator simply was not reconciling the CR.

Provisioning source: the cnpg hook applies charts/camunda-platform-8.10/test/integration/scenarios/common/resources/postgresql-cluster.yaml (a postgresql.cnpg.io/v1 Cluster named postgresql-cluster). Components connect to the in-namespace postgresql-cluster-rw service the operator is supposed to create.

Why it looks "flaky"

The same scenario passes on retry / on 06-03 and fails on 06-04 purely because CNPG operator health on the shared CI cluster varies over time. Nothing in the test inputs changed between green and red runs of the same commit.

Track A — the fix (infra; removes the flake)

  • Check the CloudNativePG operator pod health (Running, not crash-looping / OOM / rate-limited) on the CI clusters under *.ci.distro.ultrawombat.com; confirm postgresql.cnpg.io CRDs and the operator webhook are healthy.
  • Review operator logs around 2026-06-04 14:00–17:30 UTC for reconcile errors / leader-election loss.
  • Check node capacity / preemptible churn in that window.
  • Add operator-health monitoring/alerting and (optionally) a nightly pre-flight gate that asserts CNPG is healthy before the matrix fans out — so a bad cluster fails once, fast, with one clear message instead of N scenario timeouts.

Track B — harness hardening (does NOT remove the flake; makes it fast & diagnosable)

  • Give fixture-mode hooks the readiness gate that script-mode already has, so eske/keyco/esa0 fail in ~5 min with an explicit "CNPG cluster postgresql-cluster not Ready" error instead of a 20-min opaque helm timeout. Implement WaitForCNPGClusterReady(ctx, namespace, name, timeout) in scripts/camunda-core/pkg/kube (existing Client.dynamicClient, GVR postgresql.cnpg.io/v1, clusters, poll status.conditions[type=Ready]) and call it from scripts/deploy-camunda/deploy/lifecycle.go ApplyLifecycleManifests (guard with HasCRD).
  • Dump CNPG operator + Cluster.status + namespace pod/event diagnostics on not-Ready.
  • Wide blast radius (shared Go tooling): validate green across all supported chart versions.

Note: Track B alone will NOT make these scenarios pass when CNPG is down — the script-mode hooks already wait and still fail. It only converts slow/opaque failures into fast/diagnosable ones. Track A is required to actually stop the flakiness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/ciarea/testMarks an issue as improving or extending the tests of the projectcomponent/helmkind/bugSomething isn't working as intendedlikelihood/highA recurring issueseverity/midMarks a bug as having a noticeable impact but with a known workaroundtriage:completed

    Type

    No type

    Urgency

    next

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions