fix(deploy): accept idempotent 200 in tenant bootstrap (fixes catch-all provision abort + dashboard outage)#343
Merged
Conversation
…eady exists) The proxy self-provisions the catch-all tenant at startup from LLMTRACE_DEFAULT_TENANT_ID (ensure_tenant_exists), so the lifecycle's subsequent POST /api/v1/tenants for the catch-all returns 200 (already exists), not 201. _bootstrap_tenant_in_proxy raised on anything != 201, aborting provisioning BEFORE the dashboard was recreated — a live outage on 2026-05-30. Accept 200 and 201 (both carry the same body). Regression tests added.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Outage fix
On 2026-05-30 a recreate using the just-merged catch-all PRs (#341 lifecycle + #342 proxy) aborted mid-provision and left the dashboard deleted-not-recreated (live outage).
Root cause
Both the proxy (Option B, #342:
ensure_tenant_existsat startup) and the lifecycle (Option A, #341) create the catch-all tenant. The proxy wins (it runs at startup, from theLLMTRACE_DEFAULT_TENANT_IDthe lifecycle set), so the lifecycle's subsequentPOST /api/v1/tenantsfor the catch-all returns 200 (already exists) via the idempotent create — not 201._bootstrap_tenant_in_proxyraised onstatus != 201, throwing before the dashboard was recreated. CI couldn't catch it — the bug only appears in the live A+B interaction.Fix
_bootstrap_tenant_in_proxyaccepts 200 and 201 (both carry the same body; 200 = idempotent already-exists, the expected case now that the proxy self-provisions the catch-all). One-line behavior change + 3 regression tests (accept-200, accept-201, reject-5xx).Validation
550e8400, generated catch-alla0f7d8a5, tenant-less /v1 → catch-all, cost tracking + isolation intact, no6ae1ab34.Deploy tooling only — no image change, no redeploy needed (the live deployment already ran with this fix locally; this lands it on main).