Skip to content

Commit 8e6aff0

Browse files
committed
feat(migrations): enhance migration Job configuration for async execution and blocking behavior
1 parent 7dbacc8 commit 8e6aff0

5 files changed

Lines changed: 70 additions & 11 deletions

File tree

HelmChart/Docs/Postgres.md

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -373,11 +373,31 @@ The chart **rejects `poolMode: transaction` unless `migrate.enabled: true`** to
373373
prevent the unsafe combination.
374374

375375
Notes on the migration Job:
376-
- It is a `pre-upgrade` + `post-install` Helm hook. On upgrades it runs before
377-
the new pods roll (old release still serving). On a **fresh** install an init
378-
container waits for the database first, then the Job creates the schema; the
379-
app pods may briefly fail readiness until it completes. A very slow first-time
380-
cluster bootstrap may need a longer `helm upgrade --install --timeout`.
376+
- **Deploys do not block by default** (`migrate.hook: false`): the migration Job
377+
runs as a regular async Job for both install and upgrade, so `helm` returns
378+
immediately and pods roll while migrations run in the background. The trade-off
379+
is that **pods can start before migrations finish**, so keep your migrations
380+
*backward-compatible* (the new code tolerates the old schema — the
381+
expand/contract pattern). Set `migrate.hook: true` to make deploys block on a
382+
Helm hook instead (`post-install` on install, `pre-upgrade` on upgrade); the
383+
deploy then waits and new code never hits an un-migrated schema.
384+
- **Fresh-install caveat with `migrate.hook: false`:** on a brand-new install the
385+
app pods come up against an empty (un-migrated) database and will stay unready —
386+
likely `CrashLoopBackOff` — until the async Job finishes creating the schema,
387+
then self-heal. The Job's own init container waits for the database first, so a
388+
slow first-time cluster bootstrap may need a longer `helm upgrade --install
389+
--timeout`. If you want a clean first install, run it once with
390+
`--set migrate.hook=true` (blocks until the schema exists), then drop back to
391+
the async default for routine upgrades.
392+
393+
More on the migration Job:
394+
- `helm upgrade --wait` waits for Jobs too, so don't pass `--wait` if you want
395+
the async upgrade to stay non-blocking.
396+
- Finished async Jobs auto-clean after `migrate.ttlSecondsAfterFinished` (default
397+
1 day).
398+
- Alternatively, run migrations as a **separate step** before the deploy (e.g.
399+
`helm template` the Job and `kubectl apply` it, or a CI stage) and keep the
400+
app deploy itself migration-free.
381401
- Set `migrate.enabled: false` to restore the legacy "every pod migrates on
382402
boot" model (required if you keep `poolMode: session` and want boot
383403
migrations). The advisory lock stays in the code — it still protects that boot

HelmChart/Public/oneuptime/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -503,7 +503,7 @@ clickhouse:
503503
- [ ] For production high availability, run PostgreSQL and ClickHouse under their bundled operators instead of the single, standalone built-ins. Set `postgresOperator.cnpg.enabled: true` (CloudNativePG — streaming replication and automatic failover) and `clickhouseOperator.altinity.enabled: true` (Altinity — replication, sharding, and declarative lifecycle management). See the **Operator-managed PostgreSQL** and **Operator-managed ClickHouse** sections above for the full configuration. Enabling an operator bootstraps a fresh, empty cluster — if you already run a standalone database, follow the migration runbooks ([PostgreSQL](../../Docs/MigratePostgresStandaloneToOperator.md), [ClickHouse](../../Docs/MigrateClickhouseStandaloneToOperator.md)) to move your data first.
504504
- [ ] Enable the dedicated worker deployment so background jobs (telemetry ingestion, notifications, incident/alert processing, workflows) run in their own pods instead of competing with API requests on the shared event loop. Set `worker.enabled: true` — the `app` pods then stop consuming queues and the worker drains them. The worker becomes REQUIRED for all background work, so keep `worker.keda.minReplicas >= 1`, and set `app.keda.targetCPUUtilizationPercentage` (with `app.resources.requests.cpu`) so the API tier still autoscales once its queue-size trigger is disabled.
505505
- [ ] Put the bundled PgBouncer connection pooler in front of PostgreSQL if you autoscale workers (KEDA) or use a connection-limited managed/external PostgreSQL — it keeps a connection storm (for example, many worker pods booting at once) from exhausting the database. Set `pgbouncer.enabled: true`. It runs in `transaction` pool mode by default (the largest connection reduction, since idle client connections hold no backend connection), which is safe because migrations run in a dedicated Job (`migrate.enabled`, on by default) instead of on the pooled pods. Keep `pgbouncer.defaultPoolSize` and `pgbouncer.maxDbConnections` below your PostgreSQL `max_connections`. For an external/managed PostgreSQL, point `externalPostgres.host`/`.port` at the database and enable the pooler — or point them at your provider's own pooled endpoint (RDS Proxy, Neon `-pooler`, Supabase Supavisor) instead. See the **Connection pooling with PgBouncer** section in [Postgres.md](../../Docs/Postgres.md).
506-
- [ ] Confirm the database migration Job is healthy. With `migrate.enabled: true` (the default), schema and data migrations run once per release in a dedicated pre-upgrade / post-install Job rather than on every podso deploys gate on it. On a fresh install, an init container waits for the database before migrating; a slow first-time CloudNativePG bootstrap may need a longer `helm upgrade --install --timeout` (for example `--timeout 15m`). Check the Job with `kubectl get jobs -l app.kubernetes.io/component=migrate` and its logs if a deploy stalls.
506+
- [ ] Confirm the database migration Job is healthy. With `migrate.enabled: true` (the default), schema and data migrations run once per release in a dedicated Job rather than on every pod. By default it runs it **asynchronously** (`migrate.hook: false`) so deploys never block — which means pods may start before migrations finish, so keep your migrations backward-compatible, or set `migrate.hook: true` to make the deploy wait. Note: with the async default, a brand-new install leaves the app pods unready (CrashLoopBackOff) until the Job creates the schema; for a clean first install run it once with `--set migrate.hook=true` (and a longer `helm upgrade --install --timeout`, e.g. `--timeout 15m`, for a slow first-time CloudNativePG bootstrap), then drop back to the async default. Check the Job with `kubectl get jobs -l app.kubernetes.io/component=migrate` and its logs if a deploy looks wrong.
507507
- [ ] Please make sure you have static passwords for your database passwords (for Redis, ClickHouse and PostgreSQL).
508508
- [ ] Please set `oneuptimeSecret` and `encryptionSecret` (or setup in `externalSecrets` section) to a long random string. You can use a password generator to generate these strings.
509509
- [ ] Please set `probes.<key>.key` to a long random string. This is used to secure your probes.

HelmChart/Public/oneuptime/templates/migrate-job.yaml

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,24 @@
11
{{- if and .Values.migrate.enabled (not .Values.deployment.disableDeployments) }}
2+
{{- /*
3+
Block the deploy on migrations (as a Helm hook) only when migrate.hook is true.
4+
When false (default), render a regular async Job for BOTH install and upgrade,
5+
so helm returns immediately and migrations run in the background. NOTE: on a
6+
FRESH install with hook=false the app pods come up against an un-migrated
7+
database and stay unready (likely CrashLoopBackOff) until this async Job
8+
finishes creating the schema — set hook=true if you need the install to block.
9+
*/ -}}
10+
{{- $useHook := .Values.migrate.hook }}
211
# =============================================================================
3-
# Database migration Job (opt-in: migrate.enabled).
12+
# Database migration Job.
413
#
514
# Runs schema + data migrations ONCE per release, in a single dedicated pod,
615
# instead of on every app/worker/nginx replica on boot. The runtime pods are
716
# then gated off (RUN_DATABASE_MIGRATIONS_ON_BOOT=false), so the data-migration
817
# session advisory lock never runs on a pooled runtime connection — which is
918
# what makes PgBouncer transaction-mode pooling safe.
1019
#
11-
# Helm hook: runs post-install (fresh installs, after the DB exists) and
12-
# pre-upgrade (before new pods roll, while the old release + DB are still up).
1320
# Connects DIRECTLY to the backend (DirectPostgres) so migrations never depend
14-
# on the pooler.
21+
# on the pooler. Blocking (hook) vs async is controlled by migrate.hook.
1522
# =============================================================================
1623
apiVersion: batch/v1
1724
kind: Job
@@ -24,12 +31,22 @@ metadata:
2431
app.kubernetes.io/managed-by: Helm
2532
app.kubernetes.io/component: migrate
2633
appname: oneuptime
34+
{{- if $useHook }}
35+
# Blocking: a Helm hook — the deploy waits for migrations before rolling the
36+
# new pods (new code never runs against an un-migrated schema). post-install on
37+
# a fresh install; pre-upgrade on an upgrade (only when migrate.hook=true).
2738
annotations:
28-
"helm.sh/hook": post-install,pre-upgrade
39+
"helm.sh/hook": {{ ternary "post-install" "pre-upgrade" .Release.IsInstall }}
2940
"helm.sh/hook-weight": "-5"
3041
"helm.sh/hook-delete-policy": before-hook-creation
42+
{{- end }}
3143
spec:
3244
backoffLimit: {{ .Values.migrate.backoffLimit | default 6 }}
45+
{{- if not $useHook }}
46+
# Async (non-hook) Job — `helm upgrade` does not wait for it. Auto-clean after
47+
# it finishes so old per-revision Jobs don't accumulate.
48+
ttlSecondsAfterFinished: {{ .Values.migrate.ttlSecondsAfterFinished | default 86400 }}
49+
{{- end }}
3350
{{- if .Values.migrate.activeDeadlineSeconds }}
3451
activeDeadlineSeconds: {{ .Values.migrate.activeDeadlineSeconds }}
3552
{{- end }}

HelmChart/Public/oneuptime/values.schema.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -746,6 +746,12 @@
746746
"enabled": {
747747
"type": "boolean"
748748
},
749+
"hook": {
750+
"type": "boolean"
751+
},
752+
"ttlSecondsAfterFinished": {
753+
"type": "integer"
754+
},
749755
"backoffLimit": {
750756
"type": "integer"
751757
},

HelmChart/Public/oneuptime/values.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -477,6 +477,22 @@ clickhouseOperator:
477477
# docker-compose and enabled: false installs.
478478
migrate:
479479
enabled: true
480+
# Whether the deploy WAITS for migrations:
481+
# false (default): migrations run as an async background Job (both install and
482+
# upgrade) — helm returns immediately and pods roll while migrations run.
483+
# Non-blocking deploys, BUT pods may run before migrations finish, so keep
484+
# your migrations backward-compatible (expand/contract: new code must
485+
# tolerate the old schema). On a FRESH install the app pods come up against
486+
# an un-migrated database and stay unready (likely CrashLoopBackOff) until
487+
# the Job finishes creating the schema — they self-heal once it does.
488+
# true: migrations run as a BLOCKING Helm hook (post-install on install,
489+
# pre-upgrade on upgrade) — the deploy waits before rolling new pods, so new
490+
# code never hits an un-migrated schema.
491+
# Note: `helm upgrade --wait` waits for the async Job too, so don't pass --wait
492+
# if you want it non-blocking.
493+
hook: false
494+
# Auto-clean finished async Jobs after this many seconds (hook: false only).
495+
ttlSecondsAfterFinished: 86400
480496
# Job retries (handles the DB not being ready yet on a fresh install).
481497
backoffLimit: 6
482498
# Optional hard timeout for the whole migration run (seconds). Empty = none.

0 commit comments

Comments
 (0)