Skip to content

Schema Job should complete before server pods start during helm upgrade #890

@noamyehudai

Description

@noamyehudai

Is your feature request related to a problem? Please describe.

During helm upgrade from chart 0.x to 1.0.0 (Temporal Server 1.29.x → 1.30.x), the schema migration Job and server Deployments are applied simultaneously. If the visibility schema migration involves slow DDL operations (e.g., ALTER TABLE ... ADD COLUMN ... GENERATED ALWAYS AS ... STORED on a large executions_visibility table), the new server pods start before the schema is ready, fail the schema version check, and enter CrashLoopBackOff.

Combined with Kubernetes' default maxUnavailable: 25% on single-replica Deployments (which rounds up to 1), the old server pods are terminated before the new ones are healthy, resulting in full downtime for the duration of the schema migration — which can take over an hour on large tables.

This contradicts the official upgrade documentation, which recommends:

"Before initiating the Temporal Server upgrade, use one of the recommended upgrade tools to update your database schema."

The Helm chart does not enforce this ordering.

Our experience

Upgrading from 1.29.2 to 1.30.3 (chart 1.0.0) required migrating the visibility schema from v1.9 to v1.13. On our staging PostgreSQL database, this took ~2 hours due to multiple ALTER TABLE ... ADD COLUMN ... GENERATED ALWAYS AS ... STORED statements (which rewrite the entire table in PostgreSQL) and CREATE INDEX operations across v1.10–v1.13.

During this entire period, all Temporal server pods (frontend, history, matching, worker) were down.

Describe the solution you'd like

Ensure the schema migration Job completes before the server Deployment pods are rolled out. Some possible approaches:

  1. Helm pre-upgrade hook: Add helm.sh/hook: pre-upgrade annotation to the schema Job so Helm waits for it to complete before applying the server Deployment changes.

  2. Init container on server pods: Add an init container to the server Deployment that polls the schema version and blocks until it matches the expected version.

  3. Documentation: At minimum, document that users with large visibility tables should set manageSchema: false, run the schema migration manually while the old server is still running, and only then upgrade the server image — to match the recommended order in the server upgrade docs.

Additional context

  • PostgreSQL ALTER TABLE ... ADD COLUMN ... STORED rewrites the entire table, making it especially slow on large executions_visibility tables.
  • The v1.10–v1.13 visibility schema adds ~17 generated columns and ~17 indexes total.
  • Related: [Feature Request] Decouple server job manifests from Values.server.enabled #695 (requested decoupling schema jobs from server deployments, now closed).
  • The workaround we used for production: pin server/web/admintools image tags to the old version in values.yaml, let the chart upgrade deploy the schema Job with the new admin-tools image, wait for migration to complete, then remove the image pins in a second deploy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions