Skip to content

hub: optional startupProbe for slow-startup migration windows#44

Open
brandonSc wants to merge 2 commits into
earthly:mainfrom
brandonSc:brandon/hub-startup-probe
Open

hub: optional startupProbe for slow-startup migration windows#44
brandonSc wants to merge 2 commits into
earthly:mainfrom
brandonSc:brandon/hub-startup-probe

Conversation

@brandonSc

Copy link
Copy Markdown
Contributor

Summary

Adds an optional hub.startupProbe (disabled by default to preserve existing behaviour). When enabled, kubelet gives Hub the full failureThreshold × periodSeconds window to become Ready before liveness/readiness probes start counting failures.

Default tuning when enabled: 30 × 5s = 150s startup window.

Why

On 2026-05-26 the earthly-internal dogfood cluster was rolled from Hub 2.0.0e845ff9d. The new Hub takes ~52s to become Ready (Postgres schema migration v175 → v182 + the new sqlapi_user role creation). The existing chart probes are configured with initialDelaySeconds=0, periodSeconds=5, failureThreshold=3 — a 15s tolerance window — so kubelet killed the pod mid-migration in a crashloop.

The workaround was an out-of-band kubectl patch deploy/lunar-hub adding a startupProbe imperatively. That's incompatible with a future helm upgrade (would wipe the patch).

Behaviour

Mode Rendered
hub.startupProbe.enabled: false (default) nothing — same as today
hub.startupProbe.enabled: true full startupProbe: block with /health HTTP probe

While the startupProbe is running, kubelet suppresses liveness/readiness probe failures. Once the startupProbe succeeds for the first time, control hands back to liveness/readiness for ongoing health checks (no double-coverage).

Test plan

  • helm template with hub.startupProbe.enabled=true renders the full block (verified locally)
  • helm template with default values renders zero startupProbe: references (verified locally — grep -c startupProbe returns 0)
  • Apply to the earthly-internal cluster (replaces the kubectl-patched stopgap with chart-managed config) once this lands and the chart version bumps

Notes

  • No version bump in this PR — held for the next release cut. Chart 2.0.0 release happened May 20; the next minor would carry this (and other recent additions).
  • Mirrors the existing hub.livenessProbe / hub.readinessProbe shape exactly for consistency.

This PR was drafted by AI.

@brandonSc brandonSc requested a review from dchw as a code owner May 26, 2026 20:21

@me-bender me-bender Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opt-in, mirrors the liveness/readiness shape exactly, defaults render zero diff. Verified helm template both ways. Ship it.

@brandonSc brandonSc force-pushed the brandon/hub-startup-probe branch from 8f8e0c3 to 412afda Compare May 26, 2026 20:42
Adds an optional hub.startupProbe (disabled by default to preserve
existing behaviour). When enabled, kubelet gives Hub the full
failureThreshold × periodSeconds window to become Ready before the
liveness and readiness probes start counting failures.

Default tuning when enabled is 30 × 5s = 150s. Useful on a cold start
where Hub takes longer than the existing liveness probe's tolerance
(default 15s, periodSeconds=5 × failureThreshold=3) — for example
when Hub has to run a substantial Postgres schema migration as part
of an upgrade. Without this probe, large migrations risk being
interrupted mid-flight by kubelet restarting the pod.

Backward compatible: existing installs see no change unless they
explicitly set hub.startupProbe.enabled=true.
@brandonSc brandonSc force-pushed the brandon/hub-startup-probe branch from 412afda to 9706430 Compare May 26, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant