Skip to content

organize .env, point at keyvault#235

Draft
breardon2011 wants to merge 2 commits intomainfrom
keyvault-first
Draft

organize .env, point at keyvault#235
breardon2011 wants to merge 2 commits intomainfrom
keyvault-first

Conversation

@breardon2011
Copy link
Copy Markdown
Contributor

@breardon2011 breardon2011 commented May 7, 2026

Summary

Reorganizes runtime config so Key Vault is the source of truth for everything that isn't strictly per-instance bootstrap. .env files keep covering the bootstrap (mode, region, paths, ports, per-worker IDs, the KV pointer); KV holds everything else — DB/Redis URLs, JWT secret, S3 credentials, sandbox defaults, autoscaler/Azure config, billing keys, observability tokens, and so on.

This PR is the code half. The prod-side prep (populating opencomputer-prod-kv, fixing drift, and pointing each prod VM at the vault) has already been done out-of-band so the merge can be safely landed without further setup.

Code changes (two files)

File Change
internal/config/keyvault.go Adds 29 mappings to secretMapping. New shared-* prefix for values both server and worker need (DB URL, Redis URL, JWT secret, S3, sandbox defaults, sandbox domain, MAX_CAPACITY, S3 path-style, Segment write key, Axiom). Adds the missing server-* mappings for autoscaler/Azure pool config, the WorkOS extras, the public CP HTTP_ADDR, the three Stripe URLs (success / cancel / Telegram-agent price ID). Existing server-* and worker-* entries are kept as legacy aliases so unmigrated KVs continue to work.
cmd/server/main.go Thins the workerEnv template the autoscaler bakes into new worker VMs via cloud-init. Drops the 14 fat values that are now KV-loaded at worker startup (JWT_SECRET, DATABASE_URL, REDIS_URL, all S3_*, sandbox defaults, sandbox domain, Axiom tokens, Segment write key, the dead-on-worker AZURE_KEY_VAULT_NAME). Adds SECRETS_VAULT_NAME so spawned workers know which KV to query. Drops the localhostCONTROLPLANE_IP rewrite block since KV holds real private IPs already.

Why merging is safe

LoadSecretsFromKeyVault only sets an env var if the var is currently unset (keyvault.go:111). With the existing fat .env files in place, every value the binary needs is already populated by EnvironmentFile= before KV loading runs, so KV-sourced values get skipped. You'll see one log line at startup: keyvault: loaded N secrets from <vault> (M skipped, already set) with M >> N. Functional config is byte-for-byte identical to today.

For autoscaler-launched workers: the new workerEnv template writes a lean bootstrap .env that includes SECRETS_VAULT_NAME=opencomputer-prod-kv. The worker boots, KV loads shared-* (DB, Redis, JWT, S3, sandbox defaults, domain, etc.) into its empty env, and the worker behaves identically to a worker that booted off the old fat template. Same end-state, sourced differently.

Pre-merge prep (already complete on prod)

Step What was done Where
Bootstrap pointer SECRETS_VAULT_NAME=opencomputer-prod-kv appended to /etc/opensandbox/server.env on oc-controlplane-2 and to /etc/opensandbox/worker.env on each running worker. No restart triggered — picks up on next deploy. 4 VMs in opencomputer-prod
KV population Every value in the live .env files that isn't bootstrap (and has a corresponding mapping in this PR's secretMapping) is now present in opencomputer-prod-kv. 30+ entries added across shared-* and server-*. opencomputer-prod-kv (62 secrets total, was 31)
Drift fix 4 stale entries in prod KV (server-database-url, server-redis-url, worker-database-url, worker-redis-url) all had old localhost / wrong CP IP / wrong Redis protocol. All updated to match current .env. opencomputer-prod-kv

The same prep was applied earlier to opensandbox-dev-kv and validated end-to-end (smoke test: create sandbox → exec → success).

What this means for prod after merge

Path Behavior
CP redeploy via CI Reads new SECRETS_VAULT_NAME from .env, queries KV, every value is "skipped, already set" → boot identical to today.
Existing workers Untouched. Their .env is unchanged, they don't depend on KV.
Autoscaler-spawned new workers Receive lean worker.env template via cloud-init, pull shared-* from KV, boot cleanly with the same effective env as old fat-template workers.
Worker fails to boot from KV Worker never registers in Redis → GetLeastLoadedWorker never picks it → no sandboxes placed → autoscaler terminates after pendingWorkerTTL (10 min) and retries with backoff. Existing sandboxes and workers untouched.

Sentry footnote

server-sentry-dsn and worker-sentry-dsn were already populated in prod KV (pre-this-PR), but OPENSANDBOX_SENTRY_DSN is not in the current .env. After the post-merge restart, KV will load that DSN into the env and Sentry will start firing on the CP and any workers that subsequently restart. If that's intended (the values in KV suggest someone deliberately set them up), no action needed; otherwise az keyvault secret delete them before the deploy lands.

Optional: full cutover (separate operation, not in this PR)

When you want .env to actually shrink (rather than just be redundant with KV), run a per-VM thin operation:

  1. Back up the live .env (cp /etc/opensandbox/server.env /etc/opensandbox/server.env.bak-<ts>).
  2. Replace it with bootstrap-only (mode + region + paths + ports + per-instance IDs + SECRETS_VAULT_NAME).
  3. systemctl daemon-reload && systemctl restart opensandbox-{server,worker}.

After restart, the journalctl log line should show keyvault: loaded N secrets … M skipped with N now ≈ 30 on the CP and ≈ 16 on each worker (vs. ~5 pre-cutover; M is the duplication between server-*/worker-* legacy aliases and shared-*).

Rollback for the cutover is cp /etc/opensandbox/server.env.bak-<ts> /etc/opensandbox/server.env && systemctl daemon-reload && systemctl restart — ~30 s of CP downtime per VM.

This PR doesn't perform the cutover; it only makes it possible. Cutover is opt-in per environment, applied when convenient, with no coupling to the merge.

Future cleanup

Once every environment is fully cut over and stable, drop the legacy aliases (server-database-url, server-redis-url, server-jwt-secret, server-axiom-*, worker-database-url, worker-redis-url, worker-jwt-secret, worker-s3-*, worker-axiom-*) from secretMapping and az keyvault secret delete their KV duplicates. They're functionally redundant with shared-* after both prefixes hold the same values, but kept here so unmigrated environments continue to work during the rollout window.

@breardon2011 breardon2011 changed the title organize keyvault secrets, keep only bootstrap or perworker in .env organize .env, point at keyvault May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant