Skip to content

fix(deploy): accept idempotent 200 in tenant bootstrap (fixes catch-all provision abort + dashboard outage)#343

Merged
epappas merged 1 commit into
mainfrom
fix/lifecycle-bootstrap-accept-200
May 30, 2026
Merged

fix(deploy): accept idempotent 200 in tenant bootstrap (fixes catch-all provision abort + dashboard outage)#343
epappas merged 1 commit into
mainfrom
fix/lifecycle-bootstrap-accept-200

Conversation

@epappas

@epappas epappas commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Outage fix

On 2026-05-30 a recreate using the just-merged catch-all PRs (#341 lifecycle + #342 proxy) aborted mid-provision and left the dashboard deleted-not-recreated (live outage).

Root cause

Both the proxy (Option B, #342: ensure_tenant_exists at startup) and the lifecycle (Option A, #341) create the catch-all tenant. The proxy wins (it runs at startup, from the LLMTRACE_DEFAULT_TENANT_ID the lifecycle set), so the lifecycle's subsequent POST /api/v1/tenants for the catch-all returns 200 (already exists) via the idempotent create — not 201. _bootstrap_tenant_in_proxy raised on status != 201, throwing before the dashboard was recreated. CI couldn't catch it — the bug only appears in the live A+B interaction.

Fix

_bootstrap_tenant_in_proxy accepts 200 and 201 (both carry the same body; 200 = idempotent already-exists, the expected case now that the proxy self-provisions the catch-all). One-line behavior change + 3 regression tests (accept-200, accept-201, reject-5xx).

Validation

  • Fix logic verified directly (200→id, 201→id, 500→RuntimeError).
  • Live: re-ran the recreate with this fix → dashboard restored, provision completed, operator tenant 550e8400, generated catch-all a0f7d8a5, tenant-less /v1 → catch-all, cost tracking + isolation intact, no 6ae1ab34.

Deploy tooling only — no image change, no redeploy needed (the live deployment already ran with this fix locally; this lands it on main).

…eady exists)

The proxy self-provisions the catch-all tenant at startup from
LLMTRACE_DEFAULT_TENANT_ID (ensure_tenant_exists), so the lifecycle's
subsequent POST /api/v1/tenants for the catch-all returns 200 (already exists),
not 201. _bootstrap_tenant_in_proxy raised on anything != 201, aborting
provisioning BEFORE the dashboard was recreated — a live outage on 2026-05-30.
Accept 200 and 201 (both carry the same body). Regression tests added.
@epappas epappas merged commit aa4c569 into main May 30, 2026
14 checks passed
@epappas epappas deleted the fix/lifecycle-bootstrap-accept-200 branch May 30, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant