Skip to content

Commit dfd0e39

Browse files
committed
docs(design): release strategy and self-serve update proposal
Design-first response to #383: current-state facts (stale v0.2.0 release, sha/latest-only images, :latest-pinned provisioning, recreate as the only upgrade path and its URL/data consequences), proposed cadence + release-notes template with machine-readable urgency, portal version pinning + update-available banner, and the three upgrade paths gated on the URL-stability decision. Section 6 lists the four sign-off decisions; nothing here is built yet.
1 parent 70177ed commit dfd0e39

1 file changed

Lines changed: 150 additions & 0 deletions

File tree

docs/design/release-strategy.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Release strategy and self-serve instance updates
2+
3+
Status: PROPOSAL — decision asks in section 6 require sign-off before any build.
4+
Issue: techlab-innov/llmtrace#383 (mirrored as Portal#66). Author: platform
5+
operations, 2026-06-12.
6+
7+
## 1. Current state (verified)
8+
9+
- Versioning: the cargo workspace is at `0.2.1`; the last git tag and GitHub
10+
Release are `v0.2.0` (2026-04-17). Everything merged since April is live on
11+
customers' instances but unreleased and unannounced.
12+
- Images: `publish-images.yml` pushes `:sha-<7>`, `:main` and `:latest` on
13+
every qualifying push to main. There are no version-tagged images. Every
14+
published image IS uniquely addressable via its `sha-<7>` tag today.
15+
- Known trigger gap: the workflow's path filter does not include
16+
`config.example.yaml` (the baked proxy config), so config-only changes need
17+
a manual `workflow_dispatch` to rebuild.
18+
- Provisioning: the Portal pins `ghcr.io/...:latest` at provision time
19+
(`instances.image_version` exists but is never set, so the `"latest"`
20+
fallback applies). An instance therefore runs whatever `:latest` happened to
21+
be at its provision moment, and nothing records what that was.
22+
- Upgrade machinery: the Basilica SDK has create/get/delete/restart/scale —
23+
NO in-place update. The only rebuild path is recreate, and:
24+
- Basilica URLs derive from the deployment UUID, so recreate ALWAYS changes
25+
the customer's proxy and dashboard URLs.
26+
- Per-deployment sqlite dies with the pod: traces, customer-minted proxy
27+
API keys, and tenant records/tokens are lost. The portal re-seeds tenant
28+
identity (stable `tenant_uuid`) and upstream config automatically, and
29+
re-bakes the same admin key from the vault (dashboard SSO survives), but
30+
it stores only hashes/metadata for customer API keys — their plaintext
31+
cannot be restored, only re-minted with NEW values.
32+
- The proxy ML preload allows up to 1500s of startup, so a recreate can
33+
mean a multi-minute (worst case ~25 min) provisioning window.
34+
35+
## 2. What the customer asked for (#383)
36+
37+
1. A formal release cadence and release-notes policy that separates new
38+
features from security and bug fixes (including dependency-driven ones).
39+
2. In-portal "update available" notification per instance.
40+
3. A one-click upgrade button that shows the release notes before applying.
41+
4. No forced rollouts — single-tenant customers control their own timing.
42+
43+
## 3. Proposed release process (product repo)
44+
45+
### 3.1 Versioning and artifacts
46+
47+
- Semver tags `vX.Y.Z` on main; one GitHub Release per tag (canonical record);
48+
CHANGELOG.md updated in the same PR that cuts the release.
49+
- Extend `publish-images.yml` with a tag trigger: a `v*` tag builds BOTH
50+
images unconditionally (no path filter on release builds — closes the
51+
config-only gap for releases) and pushes `:X.Y.Z`, `:X.Y` and `:latest`.
52+
- First action once approved: cut `v0.3.0` covering everything since v0.2.0
53+
(the backlog includes security-relevant proxy changes — e.g. the /v1
54+
de-dup, tenant-record upstream delivery — that deserve notes).
55+
56+
### 3.2 Release-notes template
57+
58+
Required sections, in this order, every release:
59+
60+
```
61+
## Security fixes <- CVEs / dependency bumps / hardening, each with impact
62+
## Bug fixes
63+
## Features
64+
## Operational notes <- migrations, env/config changes, expected downtime
65+
Upgrade urgency: security | recommended | optional
66+
```
67+
68+
The urgency line is machine-readable on purpose — the portal surfaces it on
69+
the update banner (a `security` release renders differently from `optional`).
70+
71+
### 3.3 Cadence
72+
73+
- Minor releases: monthly floor (more often when customer-visible work lands).
74+
- Patch releases: as needed.
75+
- Security releases: out-of-band, as soon as fixed; notes may be terse at
76+
publish time and amended after customers upgrade.
77+
78+
## 4. Proposed portal work (Portal repo)
79+
80+
### 4.1 Version pinning at provision (low risk, build first)
81+
82+
- New provisions resolve the latest release version and pin the concrete
83+
image tag (`:X.Y.Z`), recording it in `instances.image_version`. No more
84+
"whatever :latest was". Existing instances get `image_version` backfilled
85+
from their deployment's image reference where Basilica reports it, else
86+
marked `unknown (pre-versioning)`.
87+
88+
### 4.2 Update-available detection
89+
90+
- A small cached check (kv table, refreshed by the existing cron, ~hourly):
91+
latest GitHub Release tag + urgency vs each instance's `image_version`.
92+
- Banner on the portal dashboard and instance pages linking to the release
93+
notes. This ships WITHOUT the button and is honest on its own: customers
94+
learn that updates exist and what is in them.
95+
96+
### 4.3 The upgrade button — gated on the URL-stability decision
97+
98+
Upgrade-by-recreate today means: new URLs, traces wiped, API keys re-minted
99+
with new values, multi-minute window. Three paths:
100+
101+
- **Option A — ship now, brutally honest.** The confirm modal enumerates
102+
exactly what changes (URL churn, trace loss, key re-mint, downtime
103+
estimate); after upgrade the portal shows the new URLs and new keys
104+
(re-minted automatically, displayed once). Cheapest; worst experience;
105+
every client integration breaks on every upgrade until re-pointed.
106+
- **Option B — portal-owned stable hostnames first (recommended).** One
107+
subdomain per instance (e.g. `{instance}.llmtrace.io`) on a portal-managed
108+
edge (Cloudflare DNS/Worker) pointing at the current Basilica URL. Upgrade
109+
becomes blue/green: create the new deployment, wait until ready (absorbs
110+
the ML-preload window with zero downtime), re-seed tenants/config, flip the
111+
alias, delete the old deployment. The customer URL never changes — this
112+
also retires URL churn for resizes and recovery recreates, and it is a
113+
prerequisite the trace-durability roadmap shares. Residual loss: traces and
114+
minted keys still die with the sqlite volume (see 4.4).
115+
- **Option C — Basilica feature ask.** In-place image update, stable URLs, or
116+
persistent volumes. Zero portal work if delivered; timeline not ours.
117+
Worth requesting in parallel regardless (it is also the blocker for trace
118+
durability), but not a plan on its own.
119+
120+
### 4.4 Customer API keys across upgrades (decision needed)
121+
122+
Even with stable URLs, recreate loses proxy-side key material. Choices:
123+
124+
- **Re-mint honestly (default):** the upgrade flow re-mints every active
125+
portal-managed key in the new proxy and shows the new values once; release
126+
notes say "rotate your clients". Zero product work; breaks client configs
127+
on every upgrade.
128+
- **Proxy key export/import (product feature):** an admin-API pair to export
129+
encrypted key records and import them into a fresh deployment, letting the
130+
portal restore keys with IDENTICAL plaintext. Touches credential-handling
131+
code — needs its own security review before being scheduled.
132+
- **External durable storage:** solves keys AND traces; blocked on Basilica
133+
persistent volumes / external DB (tracked separately).
134+
135+
## 5. Sequencing
136+
137+
1. Now (after sign-off): 3.1–3.3 — cut v0.3.0 with backfilled notes; tag-
138+
triggered image builds.
139+
2. Portal: 4.1 version pinning, then 4.2 update banner. No button yet.
140+
3. ADR with the operator: pick A/B/C (4.3) and the key story (4.4).
141+
4. Build the button on whatever 3 decides.
142+
143+
## 6. Decision asks
144+
145+
- D1: approve the cadence + template (3.2/3.3) and cutting v0.3.0.
146+
- D2: approve version-pinned provisioning replacing `:latest` (4.1).
147+
- D3: choose the upgrade path: A (honest churn), B (stable hostnames,
148+
recommended), or C-first (wait on Basilica).
149+
- D4: choose the key-continuity story: honest re-mint, or schedule the proxy
150+
key export/import feature (with security review).

0 commit comments

Comments
 (0)