Skip to content

fix(8.9): orchestration startup blocks on unreachable secondary-storage schema init (intermittent nosec failures) #6347

@eamonnmoloney

Description

@eamonnmoloney

Summary

While fixing the consistent 8.9 - nosec - install - gke (noSecondaryStorage) CI failure (see #6346), a deeper, latent robustness bug surfaced that is worth investigating on its own: when the orchestration app is configured with a secondary-storage exporter/schema target that is unreachable, the Spring bootstrap blocks on SchemaManager "init schema" retries and never binds the :9600 management port — so the kubelet startup probe gets connection refused, the container is killed and restart-loops, and Helm --wait times out.

This is intermittent, which is why it hid for a while:

Controlled reproduction (GKE, back-to-back, same cluster)

misconfigured (exporter → absent ES) corrected (no exporter)
helm install timeout, FAIL in 15m30s success in 3m18s
broker pods 0/1 Running, restarts 1/1 Running, 0 restarts
startup probe :9600 connection refused ×55 over 14m healthy
init schema retried to attempt 29 n/a

Broker signature:

io.camunda.search.schema.SchemaManager - Schema creation is enabled. Start Schema management.
RetryDecorator - Retrying operation for 'init schema': attempt 29. Message: Failed to check existence of index ...
io.camunda.zeebe.broker.exporter - Failed to open exporter 'camundaexporter'. Retrying...
Startup probe failed: dial tcp :9600: connect: connection refused
Container orchestration failed startup probe, will be restarted

#6346 fixes the immediate CI scenario by not pointing the exporter at a non-existent backend. But the underlying behavior is a bug regardless of that scenario.

Questions to investigate

  1. Should SchemaManager init run on the bootstrap/startup path synchronously at all, or should it be async / retried in the background so the management server can bind :9600 and report an unhealthy (503) startup state instead of refusing connections?
  2. Should an unreachable/misconfigured secondary storage fail fast with a clear error rather than retry indefinitely behind a blocked startup?
  3. Is the startup probe budget (failureThreshold=30, period=10s, delay=30s ≈ 330s) appropriate, and should connection refused vs 503 be distinguished?
  4. Why intermittent? Determine the timing/race that let 2026-06-04 pass with the same config (e.g., schema-check timeout vs probe budget, node CPU at startup).

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working as intendedlikelihood/midObserved occasionallyseverity/highMarks a bug as having a noticeable impact on the user with no known workaroundtriage:completed

    Type

    No type

    Urgency

    next

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions