|
| 1 | +# Release strategy and self-serve instance updates |
| 2 | + |
| 3 | +Status: PROPOSAL — decision asks in section 6 require sign-off before any build. |
| 4 | +Issue: techlab-innov/llmtrace#383 (mirrored as Portal#66). Author: platform |
| 5 | +operations, 2026-06-12. |
| 6 | + |
| 7 | +## 1. Current state (verified) |
| 8 | + |
| 9 | +- Versioning: the cargo workspace is at `0.2.1`; the last git tag and GitHub |
| 10 | + Release are `v0.2.0` (2026-04-17). Everything merged since April is live on |
| 11 | + customers' instances but unreleased and unannounced. |
| 12 | +- Images: `publish-images.yml` pushes `:sha-<7>`, `:main` and `:latest` on |
| 13 | + every qualifying push to main. There are no version-tagged images. Every |
| 14 | + published image IS uniquely addressable via its `sha-<7>` tag today. |
| 15 | +- Known trigger gap: the workflow's path filter does not include |
| 16 | + `config.example.yaml` (the baked proxy config), so config-only changes need |
| 17 | + a manual `workflow_dispatch` to rebuild. |
| 18 | +- Provisioning: the Portal pins `ghcr.io/...:latest` at provision time |
| 19 | + (`instances.image_version` exists but is never set, so the `"latest"` |
| 20 | + fallback applies). An instance therefore runs whatever `:latest` happened to |
| 21 | + be at its provision moment, and nothing records what that was. |
| 22 | +- Upgrade machinery: the Basilica SDK has create/get/delete/restart/scale — |
| 23 | + NO in-place update. The only rebuild path is recreate, and: |
| 24 | + - Basilica URLs derive from the deployment UUID, so recreate ALWAYS changes |
| 25 | + the customer's proxy and dashboard URLs. |
| 26 | + - Per-deployment sqlite dies with the pod: traces, customer-minted proxy |
| 27 | + API keys, and tenant records/tokens are lost. The portal re-seeds tenant |
| 28 | + identity (stable `tenant_uuid`) and upstream config automatically, and |
| 29 | + re-bakes the same admin key from the vault (dashboard SSO survives), but |
| 30 | + it stores only hashes/metadata for customer API keys — their plaintext |
| 31 | + cannot be restored, only re-minted with NEW values. |
| 32 | + - The proxy ML preload allows up to 1500s of startup, so a recreate can |
| 33 | + mean a multi-minute (worst case ~25 min) provisioning window. |
| 34 | + |
| 35 | +## 2. What the customer asked for (#383) |
| 36 | + |
| 37 | +1. A formal release cadence and release-notes policy that separates new |
| 38 | + features from security and bug fixes (including dependency-driven ones). |
| 39 | +2. In-portal "update available" notification per instance. |
| 40 | +3. A one-click upgrade button that shows the release notes before applying. |
| 41 | +4. No forced rollouts — single-tenant customers control their own timing. |
| 42 | + |
| 43 | +## 3. Proposed release process (product repo) |
| 44 | + |
| 45 | +### 3.1 Versioning and artifacts |
| 46 | + |
| 47 | +- Semver tags `vX.Y.Z` on main; one GitHub Release per tag (canonical record); |
| 48 | + CHANGELOG.md updated in the same PR that cuts the release. |
| 49 | +- Extend `publish-images.yml` with a tag trigger: a `v*` tag builds BOTH |
| 50 | + images unconditionally (no path filter on release builds — closes the |
| 51 | + config-only gap for releases) and pushes `:X.Y.Z`, `:X.Y` and `:latest`. |
| 52 | +- First action once approved: cut `v0.3.0` covering everything since v0.2.0 |
| 53 | + (the backlog includes security-relevant proxy changes — e.g. the /v1 |
| 54 | + de-dup, tenant-record upstream delivery — that deserve notes). |
| 55 | + |
| 56 | +### 3.2 Release-notes template |
| 57 | + |
| 58 | +Required sections, in this order, every release: |
| 59 | + |
| 60 | +``` |
| 61 | +## Security fixes <- CVEs / dependency bumps / hardening, each with impact |
| 62 | +## Bug fixes |
| 63 | +## Features |
| 64 | +## Operational notes <- migrations, env/config changes, expected downtime |
| 65 | +Upgrade urgency: security | recommended | optional |
| 66 | +``` |
| 67 | + |
| 68 | +The urgency line is machine-readable on purpose — the portal surfaces it on |
| 69 | +the update banner (a `security` release renders differently from `optional`). |
| 70 | + |
| 71 | +### 3.3 Cadence |
| 72 | + |
| 73 | +- Minor releases: monthly floor (more often when customer-visible work lands). |
| 74 | +- Patch releases: as needed. |
| 75 | +- Security releases: out-of-band, as soon as fixed; notes may be terse at |
| 76 | + publish time and amended after customers upgrade. |
| 77 | + |
| 78 | +## 4. Proposed portal work (Portal repo) |
| 79 | + |
| 80 | +### 4.1 Version pinning at provision (low risk, build first) |
| 81 | + |
| 82 | +- New provisions resolve the latest release version and pin the concrete |
| 83 | + image tag (`:X.Y.Z`), recording it in `instances.image_version`. No more |
| 84 | + "whatever :latest was". Existing instances get `image_version` backfilled |
| 85 | + from their deployment's image reference where Basilica reports it, else |
| 86 | + marked `unknown (pre-versioning)`. |
| 87 | + |
| 88 | +### 4.2 Update-available detection |
| 89 | + |
| 90 | +- A small cached check (kv table, refreshed by the existing cron, ~hourly): |
| 91 | + latest GitHub Release tag + urgency vs each instance's `image_version`. |
| 92 | +- Banner on the portal dashboard and instance pages linking to the release |
| 93 | + notes. This ships WITHOUT the button and is honest on its own: customers |
| 94 | + learn that updates exist and what is in them. |
| 95 | + |
| 96 | +### 4.3 The upgrade button — gated on the URL-stability decision |
| 97 | + |
| 98 | +Upgrade-by-recreate today means: new URLs, traces wiped, API keys re-minted |
| 99 | +with new values, multi-minute window. Three paths: |
| 100 | + |
| 101 | +- **Option A — ship now, brutally honest.** The confirm modal enumerates |
| 102 | + exactly what changes (URL churn, trace loss, key re-mint, downtime |
| 103 | + estimate); after upgrade the portal shows the new URLs and new keys |
| 104 | + (re-minted automatically, displayed once). Cheapest; worst experience; |
| 105 | + every client integration breaks on every upgrade until re-pointed. |
| 106 | +- **Option B — portal-owned stable hostnames first (recommended).** One |
| 107 | + subdomain per instance (e.g. `{instance}.llmtrace.io`) on a portal-managed |
| 108 | + edge (Cloudflare DNS/Worker) pointing at the current Basilica URL. Upgrade |
| 109 | + becomes blue/green: create the new deployment, wait until ready (absorbs |
| 110 | + the ML-preload window with zero downtime), re-seed tenants/config, flip the |
| 111 | + alias, delete the old deployment. The customer URL never changes — this |
| 112 | + also retires URL churn for resizes and recovery recreates, and it is a |
| 113 | + prerequisite the trace-durability roadmap shares. Residual loss: traces and |
| 114 | + minted keys still die with the sqlite volume (see 4.4). |
| 115 | +- **Option C — Basilica feature ask.** In-place image update, stable URLs, or |
| 116 | + persistent volumes. Zero portal work if delivered; timeline not ours. |
| 117 | + Worth requesting in parallel regardless (it is also the blocker for trace |
| 118 | + durability), but not a plan on its own. |
| 119 | + |
| 120 | +### 4.4 Customer API keys across upgrades (decision needed) |
| 121 | + |
| 122 | +Even with stable URLs, recreate loses proxy-side key material. Choices: |
| 123 | + |
| 124 | +- **Re-mint honestly (default):** the upgrade flow re-mints every active |
| 125 | + portal-managed key in the new proxy and shows the new values once; release |
| 126 | + notes say "rotate your clients". Zero product work; breaks client configs |
| 127 | + on every upgrade. |
| 128 | +- **Proxy key export/import (product feature):** an admin-API pair to export |
| 129 | + encrypted key records and import them into a fresh deployment, letting the |
| 130 | + portal restore keys with IDENTICAL plaintext. Touches credential-handling |
| 131 | + code — needs its own security review before being scheduled. |
| 132 | +- **External durable storage:** solves keys AND traces; blocked on Basilica |
| 133 | + persistent volumes / external DB (tracked separately). |
| 134 | + |
| 135 | +## 5. Sequencing |
| 136 | + |
| 137 | +1. Now (after sign-off): 3.1–3.3 — cut v0.3.0 with backfilled notes; tag- |
| 138 | + triggered image builds. |
| 139 | +2. Portal: 4.1 version pinning, then 4.2 update banner. No button yet. |
| 140 | +3. ADR with the operator: pick A/B/C (4.3) and the key story (4.4). |
| 141 | +4. Build the button on whatever 3 decides. |
| 142 | + |
| 143 | +## 6. Decision asks |
| 144 | + |
| 145 | +- D1: approve the cadence + template (3.2/3.3) and cutting v0.3.0. |
| 146 | +- D2: approve version-pinned provisioning replacing `:latest` (4.1). |
| 147 | +- D3: choose the upgrade path: A (honest churn), B (stable hostnames, |
| 148 | + recommended), or C-first (wait on Basilica). |
| 149 | +- D4: choose the key-continuity story: honest re-mint, or schedule the proxy |
| 150 | + key export/import feature (with security review). |
0 commit comments