organize .env, point at keyvault#235
Draft
breardon2011 wants to merge 2 commits intomainfrom
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reorganizes runtime config so Key Vault is the source of truth for everything that isn't strictly per-instance bootstrap.
.envfiles keep covering the bootstrap (mode, region, paths, ports, per-worker IDs, the KV pointer); KV holds everything else — DB/Redis URLs, JWT secret, S3 credentials, sandbox defaults, autoscaler/Azure config, billing keys, observability tokens, and so on.This PR is the code half. The prod-side prep (populating
opencomputer-prod-kv, fixing drift, and pointing each prod VM at the vault) has already been done out-of-band so the merge can be safely landed without further setup.Code changes (two files)
internal/config/keyvault.gosecretMapping. Newshared-*prefix for values both server and worker need (DB URL, Redis URL, JWT secret, S3, sandbox defaults, sandbox domain, MAX_CAPACITY, S3 path-style, Segment write key, Axiom). Adds the missingserver-*mappings for autoscaler/Azure pool config, the WorkOS extras, the public CP HTTP_ADDR, the three Stripe URLs (success / cancel / Telegram-agent price ID). Existingserver-*andworker-*entries are kept as legacy aliases so unmigrated KVs continue to work.cmd/server/main.goworkerEnvtemplate the autoscaler bakes into new worker VMs via cloud-init. Drops the 14 fat values that are now KV-loaded at worker startup (JWT_SECRET,DATABASE_URL,REDIS_URL, allS3_*, sandbox defaults, sandbox domain, Axiom tokens, Segment write key, the dead-on-workerAZURE_KEY_VAULT_NAME). AddsSECRETS_VAULT_NAMEso spawned workers know which KV to query. Drops thelocalhost→CONTROLPLANE_IPrewrite block since KV holds real private IPs already.Why merging is safe
LoadSecretsFromKeyVaultonly sets an env var if the var is currently unset (keyvault.go:111). With the existing fat.envfiles in place, every value the binary needs is already populated byEnvironmentFile=before KV loading runs, so KV-sourced values get skipped. You'll see one log line at startup:keyvault: loaded N secrets from <vault> (M skipped, already set)with M >> N. Functional config is byte-for-byte identical to today.For autoscaler-launched workers: the new
workerEnvtemplate writes a lean bootstrap.envthat includesSECRETS_VAULT_NAME=opencomputer-prod-kv. The worker boots, KV loadsshared-*(DB, Redis, JWT, S3, sandbox defaults, domain, etc.) into its empty env, and the worker behaves identically to a worker that booted off the old fat template. Same end-state, sourced differently.Pre-merge prep (already complete on prod)
SECRETS_VAULT_NAME=opencomputer-prod-kvappended to/etc/opensandbox/server.envonoc-controlplane-2and to/etc/opensandbox/worker.envon each running worker. No restart triggered — picks up on next deploy.opencomputer-prod.envfiles that isn't bootstrap (and has a corresponding mapping in this PR'ssecretMapping) is now present inopencomputer-prod-kv. 30+ entries added acrossshared-*andserver-*.opencomputer-prod-kv(62 secrets total, was 31)server-database-url,server-redis-url,worker-database-url,worker-redis-url) all had oldlocalhost/ wrong CP IP / wrong Redis protocol. All updated to match current.env.opencomputer-prod-kvThe same prep was applied earlier to
opensandbox-dev-kvand validated end-to-end (smoke test: create sandbox → exec → success).What this means for prod after merge
SECRETS_VAULT_NAMEfrom.env, queries KV, every value is "skipped, already set" → boot identical to today..envis unchanged, they don't depend on KV.worker.envtemplate via cloud-init, pullshared-*from KV, boot cleanly with the same effective env as old fat-template workers.GetLeastLoadedWorkernever picks it → no sandboxes placed → autoscaler terminates afterpendingWorkerTTL(10 min) and retries with backoff. Existing sandboxes and workers untouched.Sentry footnote
server-sentry-dsnandworker-sentry-dsnwere already populated in prod KV (pre-this-PR), butOPENSANDBOX_SENTRY_DSNis not in the current.env. After the post-merge restart, KV will load that DSN into the env and Sentry will start firing on the CP and any workers that subsequently restart. If that's intended (the values in KV suggest someone deliberately set them up), no action needed; otherwiseaz keyvault secret deletethem before the deploy lands.Optional: full cutover (separate operation, not in this PR)
When you want
.envto actually shrink (rather than just be redundant with KV), run a per-VM thin operation:.env(cp /etc/opensandbox/server.env /etc/opensandbox/server.env.bak-<ts>).SECRETS_VAULT_NAME).systemctl daemon-reload && systemctl restart opensandbox-{server,worker}.After restart, the journalctl log line should show
keyvault: loaded N secrets … M skippedwith N now ≈ 30 on the CP and ≈ 16 on each worker (vs. ~5 pre-cutover; M is the duplication betweenserver-*/worker-*legacy aliases andshared-*).Rollback for the cutover is
cp /etc/opensandbox/server.env.bak-<ts> /etc/opensandbox/server.env && systemctl daemon-reload && systemctl restart— ~30 s of CP downtime per VM.This PR doesn't perform the cutover; it only makes it possible. Cutover is opt-in per environment, applied when convenient, with no coupling to the merge.
Future cleanup
Once every environment is fully cut over and stable, drop the legacy aliases (
server-database-url,server-redis-url,server-jwt-secret,server-axiom-*,worker-database-url,worker-redis-url,worker-jwt-secret,worker-s3-*,worker-axiom-*) fromsecretMappingandaz keyvault secret deletetheir KV duplicates. They're functionally redundant withshared-*after both prefixes hold the same values, but kept here so unmigrated environments continue to work during the rollout window.