feat(migrations): enhance migration Job configuration for async execution and blocking behavior

nawazdhandala · nawazdhandala · commit 8e6aff08ba15 · 2026-06-22T11:13:41.000+01:00
diff --git a/HelmChart/Docs/Postgres.md b/HelmChart/Docs/Postgres.md
@@ -373,11 +373,31 @@ The chart **rejects `poolMode: transaction` unless `migrate.enabled: true`** to
 prevent the unsafe combination.
 
 Notes on the migration Job:
-- It is a `pre-upgrade` + `post-install` Helm hook. On upgrades it runs before
-  the new pods roll (old release still serving). On a **fresh** install an init
-  container waits for the database first, then the Job creates the schema; the
-  app pods may briefly fail readiness until it completes. A very slow first-time
-  cluster bootstrap may need a longer `helm upgrade --install --timeout`.
+- **Deploys do not block by default** (`migrate.hook: false`): the migration Job
+  runs as a regular async Job for both install and upgrade, so `helm` returns
+  immediately and pods roll while migrations run in the background. The trade-off
+  is that **pods can start before migrations finish**, so keep your migrations
+  *backward-compatible* (the new code tolerates the old schema — the
+  expand/contract pattern). Set `migrate.hook: true` to make deploys block on a
+  Helm hook instead (`post-install` on install, `pre-upgrade` on upgrade); the
+  deploy then waits and new code never hits an un-migrated schema.
+- **Fresh-install caveat with `migrate.hook: false`:** on a brand-new install the
+  app pods come up against an empty (un-migrated) database and will stay unready —
+  likely `CrashLoopBackOff` — until the async Job finishes creating the schema,
+  then self-heal. The Job's own init container waits for the database first, so a
+  slow first-time cluster bootstrap may need a longer `helm upgrade --install
+  --timeout`. If you want a clean first install, run it once with
+  `--set migrate.hook=true` (blocks until the schema exists), then drop back to
+  the async default for routine upgrades.
+
+More on the migration Job:
+- `helm upgrade --wait` waits for Jobs too, so don't pass `--wait` if you want
+  the async upgrade to stay non-blocking.
+- Finished async Jobs auto-clean after `migrate.ttlSecondsAfterFinished` (default
+  1 day).
+- Alternatively, run migrations as a **separate step** before the deploy (e.g.
+  `helm template` the Job and `kubectl apply` it, or a CI stage) and keep the
+  app deploy itself migration-free.
 - Set `migrate.enabled: false` to restore the legacy "every pod migrates on
   boot" model (required if you keep `poolMode: session` and want boot
   migrations). The advisory lock stays in the code — it still protects that boot
diff --git a/HelmChart/Public/oneuptime/README.md b/HelmChart/Public/oneuptime/README.md
@@ -503,7 +503,7 @@ clickhouse:
 - [ ] For production high availability, run PostgreSQL and ClickHouse under their bundled operators instead of the single, standalone built-ins. Set `postgresOperator.cnpg.enabled: true` (CloudNativePG — streaming replication and automatic failover) and `clickhouseOperator.altinity.enabled: true` (Altinity — replication, sharding, and declarative lifecycle management). See the **Operator-managed PostgreSQL** and **Operator-managed ClickHouse** sections above for the full configuration. Enabling an operator bootstraps a fresh, empty cluster — if you already run a standalone database, follow the migration runbooks ([PostgreSQL](../../Docs/MigratePostgresStandaloneToOperator.md), [ClickHouse](../../Docs/MigrateClickhouseStandaloneToOperator.md)) to move your data first.
 - [ ] Enable the dedicated worker deployment so background jobs (telemetry ingestion, notifications, incident/alert processing, workflows) run in their own pods instead of competing with API requests on the shared event loop. Set `worker.enabled: true` — the `app` pods then stop consuming queues and the worker drains them. The worker becomes REQUIRED for all background work, so keep `worker.keda.minReplicas >= 1`, and set `app.keda.targetCPUUtilizationPercentage` (with `app.resources.requests.cpu`) so the API tier still autoscales once its queue-size trigger is disabled.
 - [ ] Put the bundled PgBouncer connection pooler in front of PostgreSQL if you autoscale workers (KEDA) or use a connection-limited managed/external PostgreSQL — it keeps a connection storm (for example, many worker pods booting at once) from exhausting the database. Set `pgbouncer.enabled: true`. It runs in `transaction` pool mode by default (the largest connection reduction, since idle client connections hold no backend connection), which is safe because migrations run in a dedicated Job (`migrate.enabled`, on by default) instead of on the pooled pods. Keep `pgbouncer.defaultPoolSize` and `pgbouncer.maxDbConnections` below your PostgreSQL `max_connections`. For an external/managed PostgreSQL, point `externalPostgres.host`/`.port` at the database and enable the pooler — or point them at your provider's own pooled endpoint (RDS Proxy, Neon `-pooler`, Supabase Supavisor) instead. See the **Connection pooling with PgBouncer** section in [Postgres.md](../../Docs/Postgres.md).
-- [ ] Confirm the database migration Job is healthy. With `migrate.enabled: true` (the default), schema and data migrations run once per release in a dedicated pre-upgrade / post-install Job rather than on every pod — so deploys gate on it. On a fresh install, an init container waits for the database before migrating; a slow first-time CloudNativePG bootstrap may need a longer `helm upgrade --install --timeout` (for example `--timeout 15m`). Check the Job with `kubectl get jobs -l app.kubernetes.io/component=migrate` and its logs if a deploy stalls.
+- [ ] Confirm the database migration Job is healthy. With `migrate.enabled: true` (the default), schema and data migrations run once per release in a dedicated Job rather than on every pod. By default it runs it **asynchronously** (`migrate.hook: false`) so deploys never block — which means pods may start before migrations finish, so keep your migrations backward-compatible, or set `migrate.hook: true` to make the deploy wait. Note: with the async default, a brand-new install leaves the app pods unready (CrashLoopBackOff) until the Job creates the schema; for a clean first install run it once with `--set migrate.hook=true` (and a longer `helm upgrade --install --timeout`, e.g. `--timeout 15m`, for a slow first-time CloudNativePG bootstrap), then drop back to the async default. Check the Job with `kubectl get jobs -l app.kubernetes.io/component=migrate` and its logs if a deploy looks wrong.
 - [ ] Please make sure you have static passwords for your database passwords (for Redis, ClickHouse and PostgreSQL).
 - [ ] Please set `oneuptimeSecret` and `encryptionSecret` (or setup in `externalSecrets` section) to a long random string. You can use a password generator to generate these strings.
 - [ ] Please set `probes.<key>.key` to a long random string. This is used to secure your probes.
diff --git a/HelmChart/Public/oneuptime/templates/migrate-job.yaml b/HelmChart/Public/oneuptime/templates/migrate-job.yaml
@@ -1,17 +1,24 @@
 {{- if and .Values.migrate.enabled (not .Values.deployment.disableDeployments) }}
+{{- /*
+  Block the deploy on migrations (as a Helm hook) only when migrate.hook is true.
+  When false (default), render a regular async Job for BOTH install and upgrade,
+  so helm returns immediately and migrations run in the background. NOTE: on a
+  FRESH install with hook=false the app pods come up against an un-migrated
+  database and stay unready (likely CrashLoopBackOff) until this async Job
+  finishes creating the schema — set hook=true if you need the install to block.
+*/ -}}
+{{- $useHook := .Values.migrate.hook }}
 # =============================================================================
-# Database migration Job (opt-in: migrate.enabled).
+# Database migration Job.
 #
 # Runs schema + data migrations ONCE per release, in a single dedicated pod,
 # instead of on every app/worker/nginx replica on boot. The runtime pods are
 # then gated off (RUN_DATABASE_MIGRATIONS_ON_BOOT=false), so the data-migration
 # session advisory lock never runs on a pooled runtime connection — which is
 # what makes PgBouncer transaction-mode pooling safe.
 #
-# Helm hook: runs post-install (fresh installs, after the DB exists) and
-# pre-upgrade (before new pods roll, while the old release + DB are still up).
 # Connects DIRECTLY to the backend (DirectPostgres) so migrations never depend
-# on the pooler.
+# on the pooler. Blocking (hook) vs async is controlled by migrate.hook.
 # =============================================================================
 apiVersion: batch/v1
 kind: Job
@@ -24,12 +31,22 @@ metadata:
     app.kubernetes.io/managed-by: Helm
     app.kubernetes.io/component: migrate
     appname: oneuptime
+  {{- if $useHook }}
+  # Blocking: a Helm hook — the deploy waits for migrations before rolling the
+  # new pods (new code never runs against an un-migrated schema). post-install on
+  # a fresh install; pre-upgrade on an upgrade (only when migrate.hook=true).
   annotations:
-    "helm.sh/hook": post-install,pre-upgrade
+    "helm.sh/hook": {{ ternary "post-install" "pre-upgrade" .Release.IsInstall }}
     "helm.sh/hook-weight": "-5"
     "helm.sh/hook-delete-policy": before-hook-creation
+  {{- end }}
 spec:
   backoffLimit: {{ .Values.migrate.backoffLimit | default 6 }}
+  {{- if not $useHook }}
+  # Async (non-hook) Job — `helm upgrade` does not wait for it. Auto-clean after
+  # it finishes so old per-revision Jobs don't accumulate.
+  ttlSecondsAfterFinished: {{ .Values.migrate.ttlSecondsAfterFinished | default 86400 }}
+  {{- end }}
   {{- if .Values.migrate.activeDeadlineSeconds }}
   activeDeadlineSeconds: {{ .Values.migrate.activeDeadlineSeconds }}
   {{- end }}
diff --git a/HelmChart/Public/oneuptime/values.schema.json b/HelmChart/Public/oneuptime/values.schema.json
@@ -746,6 +746,12 @@
                 "enabled": {
                     "type": "boolean"
                 },
+                "hook": {
+                    "type": "boolean"
+                },
+                "ttlSecondsAfterFinished": {
+                    "type": "integer"
+                },
                 "backoffLimit": {
                     "type": "integer"
                 },
diff --git a/HelmChart/Public/oneuptime/values.yaml b/HelmChart/Public/oneuptime/values.yaml
@@ -477,6 +477,22 @@ clickhouseOperator:
 # docker-compose and enabled: false installs.
 migrate:
   enabled: true
+  # Whether the deploy WAITS for migrations:
+  #   false (default): migrations run as an async background Job (both install and
+  #     upgrade) — helm returns immediately and pods roll while migrations run.
+  #     Non-blocking deploys, BUT pods may run before migrations finish, so keep
+  #     your migrations backward-compatible (expand/contract: new code must
+  #     tolerate the old schema). On a FRESH install the app pods come up against
+  #     an un-migrated database and stay unready (likely CrashLoopBackOff) until
+  #     the Job finishes creating the schema — they self-heal once it does.
+  #   true: migrations run as a BLOCKING Helm hook (post-install on install,
+  #     pre-upgrade on upgrade) — the deploy waits before rolling new pods, so new
+  #     code never hits an un-migrated schema.
+  # Note: `helm upgrade --wait` waits for the async Job too, so don't pass --wait
+  # if you want it non-blocking.
+  hook: false
+  # Auto-clean finished async Jobs after this many seconds (hook: false only).
+  ttlSecondsAfterFinished: 86400
   # Job retries (handles the DB not being ready yet on a fresh install).
   backoffLimit: 6
   # Optional hard timeout for the whole migration run (seconds). Empty = none.