Skip to content

feat(basilica): mint scoped operator key for tenant runtime traffic#239

Merged
epappas merged 1 commit into
mainfrom
feat/basilica-scoped-operator-key
May 19, 2026
Merged

feat(basilica): mint scoped operator key for tenant runtime traffic#239
epappas merged 1 commit into
mainfrom
feat/basilica-scoped-operator-key

Conversation

@epappas
Copy link
Copy Markdown
Collaborator

@epappas epappas commented May 19, 2026

Summary

Splits per-tenant proxy auth into two scoped keys so a compromised tenant app no longer holds admin scope on its own LLMTrace pod (PR 2 of 4 in the Basilica security follow-up series).

  • admin_key (bootstrap admin): retained by the caller for the self-service / admin portal and used internally by the lifecycle layer.
  • api_key (operator): minted on the live proxy via POST /api/v1/auth/keys after readiness. This is the bearer the tenant's runtime apps use. Cannot mint keys, manage tenants, read audit logs, or change feature flags.

Per-file changes

deployments/basilica/lifecycle.py

  • New TenantInstances.admin_key: Optional[str] field.
  • New helpers (all under 50 LOC, fully typed, fail-fast):
    • _admin_http_request — single urllib-based HTTP boundary against the proxy admin API. No new pip deps.
    • _bootstrap_tenant_in_proxyPOST /api/v1/tenants to materialise the per-pod tenant row.
    • _mint_operator_keyPOST /api/v1/auth/keys with {name: "tenant-runtime", role: "operator", tenant_id: <uuid>}.
    • _find_tenant_by_labelGET /api/v1/tenants matched by name (used on restart to rediscover the tenant UUID).
    • _find_operator_key_recordGET /api/v1/auth/keys filtered to non-revoked tenant-runtime/operator.
    • _verify_or_remint_operator_key — restart-strategy entry point.
    • _inject_runtime_key_into_dashboard — adds LLMTRACE_AUTH_RUNTIME_KEY to the dashboard env.
  • provision() rewritten: bootstrap admin key → deploy proxy → wait ready → create tenant row → mint operator key → deploy dashboard with both keys.
  • update(strategy="recreate") delegates to provision() so the operator key is always re-minted (DB is gone).
  • update(strategy="restart") rediscovers tenant + verifies operator key; re-mints only if missing. When the key persists, api_key is None (caller carries forward — proxy stores only a hash).
  • _apply_proxy_auth renamed param from api_key to admin_key for clarity; behaviour unchanged (admin key injected into both envs as LLMTRACE_AUTH_ADMIN_KEY).

deployments/basilica/cli.py

  • _serialise() emits admin_key alongside api_key.

.github/workflows/tenant-lifecycle.yml

  • ::add-mask:: now masks BOTH api_key and admin_key before any cat result.json operation.
  • Both keys exposed as step outputs (downstream consumers fetch them via the Actions API).

deployments/basilica/configs/examples/{starter,pro}.yaml

  • Comments rewritten to describe the two-tier model and the bootstrap sequence.

deployments/basilica/README.md

  • Section "Per-tenant API key auth" rewritten to describe the two keys, the bootstrap sequence, update semantics (recreate re-mints; restart verifies-or-re-mints), and the dashboard wiring follow-up.

deployments/basilica/tests/ (new)

  • conftest.py: installs a basilica SDK stub so unit tests run without the upstream SDK.
  • test_operator_key_minting.py: 19 tests covering the admin HTTP boundary, error paths, restart-flow helpers, dashboard env injection, the TenantInstances shape, the CLI's _serialise(), and a provision() integration test using a real http.server and a fake Basilica client.

Validation evidence

$ python3 -c "from deployments.basilica import lifecycle, cli; print('ok')"
ok

$ python3 -m pytest deployments/basilica/tests/ -v
...
============================== 19 passed in 8.71s ==============================

Both provision() and update() integration paths run against a real in-process HTTP server with hand-crafted proxy responses, so the urllib client's headers (Authorization: Bearer, X-LLMTrace-Tenant-ID), JSON encoding, error decoding, list-vs-dict response unwrapping, and the lifecycle layer's overall sequencing are all exercised by the test suite.

The workflow YAML was syntax-checked with yaml.safe_load.

What is NOT live-validated

Live Basilica end-to-end validation pending — will be run by the maintainer after merge. This worktree has no Basilica credentials and no live proxy URL to provision against. The HTTP boundary contract is exercised against a fake proxy that mirrors the real proxy's documented response shapes (auth.rs::CreateApiKeyResponse, tenant_api.rs::CreateTenantResponse), but the actual handshake against a freshly-deployed proxy pod has not been observed.

Suggested post-merge validation:

  1. Trigger tenant-lifecycle.yml with action=provision for a test tenant.
  2. Verify the workflow result JSON has both api_key (operator-scoped) and admin_key (admin-scoped) populated and both ::add-mask::'d in the run log.
  3. Try a proxy call with api_key → expect 200 on /v1/chat/completions, 403 on POST /api/v1/auth/keys (no admin scope).
  4. Try the same call with admin_key → expect 200 on both.
  5. Run action=update strategy=restart and confirm api_key is null in the result (existing key carried forward).
  6. Run action=update strategy=recreate and confirm a fresh api_key is returned.

Trade-offs

  • Restart-flow tenant rediscovery by label, not UUID. The lifecycle layer doesn't persist the proxy-side tenant UUID across calls (it's generated by the proxy at create time). On restart we list GET /api/v1/tenants and match by name == spec.tenant_id. This is O(n) in proxy-side tenants per restart and assumes the label is unique within the pod. Since the proxy is single-tenant-per-pod in this deployment model, that holds. If the pod ever holds multiple tenants we'd need to either propagate the UUID through the caller's DB or store it client-side.
  • urllib over requests/httpx. Keeps the dependency footprint identical (basilica-sdk + PyYAML). Trade-off is more boilerplate in _admin_http_request, but the API surface is small (5 calls).
  • Dashboard LLMTRACE_AUTH_RUNTIME_KEY is informational. The Next.js dashboard only reads LLMTRACE_AUTH_ADMIN_KEY today (dashboard/src/lib/api.ts, dashboard/src/lib/proxy-helpers.ts). The runtime key is set so the dashboard wiring follow-up becomes a pure dashboard change.

Follow-ups (out of scope, tracked separately)

  • PR 3 — admin key rotation. The bootstrap admin key lives forever today; rotation tooling and a key-rotation workflow are the next PR in the series.
  • Dashboard wiring. Switch the dashboard to use LLMTRACE_AUTH_RUNTIME_KEY for tenant-facing traffic and reserve LLMTRACE_AUTH_ADMIN_KEY for the admin pages.

Test plan

  • Import sanity: python3 -c "from deployments.basilica import lifecycle, cli; print('ok')"
  • Unit + integration tests: python3 -m pytest deployments/basilica/tests/ -v (19 pass)
  • Workflow YAML parses (yaml.safe_load)
  • Live Basilica provision — maintainer to run post-merge
  • Live operator-key call against the deployed proxy — maintainer to run post-merge
  • Live restart-flow key persistence check — maintainer to run post-merge

Splits per-tenant proxy auth into two roles so a compromised tenant app
no longer holds admin scope on its own LLMTrace pod:

- admin_key (bootstrap): retained by the caller. Used by the lifecycle
  layer to mint per-tenant keys and by the self-service / admin portal.
- api_key (operator): minted on the live proxy via POST /api/v1/auth/keys
  after readiness, returned in TenantInstances.api_key. This is the
  bearer the tenant's runtime apps use.

provision() now bootstraps the per-pod tenant row via POST /api/v1/tenants,
mints the operator key, and injects LLMTRACE_AUTH_RUNTIME_KEY into the
dashboard env (informational; dashboard wiring is a follow-up).

update(strategy="restart") rediscovers the tenant by label, lists keys,
and re-mints only when the operator record is missing. update(strategy=
"recreate") always re-mints since the DB volume is destroyed.

cli.py emits admin_key alongside api_key. The tenant-lifecycle workflow
masks BOTH keys via ::add-mask:: before any cat result.json and exposes
both as step outputs.

Adds deployments/basilica/tests/ with 19 unit tests that exercise the
real urllib admin-API client against an in-process http.server, plus a
provision() integration test using a fake Basilica client.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant