Summary
While fixing the consistent 8.9 - nosec - install - gke (noSecondaryStorage) CI failure (see #6346), a deeper, latent robustness bug surfaced that is worth investigating on its own: when the orchestration app is configured with a secondary-storage exporter/schema target that is unreachable, the Spring bootstrap blocks on SchemaManager "init schema" retries and never binds the :9600 management port — so the kubelet startup probe gets connection refused, the container is killed and restart-loops, and Helm --wait times out.
This is intermittent, which is why it hid for a while:
Controlled reproduction (GKE, back-to-back, same cluster)
|
misconfigured (exporter → absent ES) |
corrected (no exporter) |
| helm install |
timeout, FAIL in 15m30s |
success in 3m18s |
| broker pods |
0/1 Running, restarts |
1/1 Running, 0 restarts |
| startup probe |
:9600 connection refused ×55 over 14m |
healthy |
init schema |
retried to attempt 29 |
n/a |
Broker signature:
io.camunda.search.schema.SchemaManager - Schema creation is enabled. Start Schema management.
RetryDecorator - Retrying operation for 'init schema': attempt 29. Message: Failed to check existence of index ...
io.camunda.zeebe.broker.exporter - Failed to open exporter 'camundaexporter'. Retrying...
Startup probe failed: dial tcp :9600: connect: connection refused
Container orchestration failed startup probe, will be restarted
#6346 fixes the immediate CI scenario by not pointing the exporter at a non-existent backend. But the underlying behavior is a bug regardless of that scenario.
Questions to investigate
- Should
SchemaManager init run on the bootstrap/startup path synchronously at all, or should it be async / retried in the background so the management server can bind :9600 and report an unhealthy (503) startup state instead of refusing connections?
- Should an unreachable/misconfigured secondary storage fail fast with a clear error rather than retry indefinitely behind a blocked startup?
- Is the startup probe budget (
failureThreshold=30, period=10s, delay=30s ≈ 330s) appropriate, and should connection refused vs 503 be distinguished?
- Why intermittent? Determine the timing/race that let 2026-06-04 pass with the same config (e.g., schema-check timeout vs probe budget, node CPU at startup).
Notes
Summary
While fixing the consistent
8.9 - nosec - install - gke(noSecondaryStorage) CI failure (see #6346), a deeper, latent robustness bug surfaced that is worth investigating on its own: when the orchestration app is configured with a secondary-storage exporter/schema target that is unreachable, the Spring bootstrap blocks onSchemaManager"init schema" retries and never binds the:9600management port — so the kubelet startup probe getsconnection refused, the container is killed and restart-loops, and Helm--waittimes out.This is intermittent, which is why it hid for a while:
context deadline exceeded.Controlled reproduction (GKE, back-to-back, same cluster)
0/1 Running, restarts1/1 Running, 0 restarts:9600 connection refused×55 over 14minit schemaBroker signature:
#6346 fixes the immediate CI scenario by not pointing the exporter at a non-existent backend. But the underlying behavior is a bug regardless of that scenario.
Questions to investigate
SchemaManagerinit run on the bootstrap/startup path synchronously at all, or should it be async / retried in the background so the management server can bind:9600and report anunhealthy(503) startup state instead of refusing connections?failureThreshold=30,period=10s,delay=30s≈ 330s) appropriate, and shouldconnection refusedvs503be distinguished?Notes
io.camunda.search.schema.SchemaManager, exporter open lifecycle) in camunda-monorepo — may need a cross-repo issue/transfer.