Skip to content

Commit 9ff7fcc

Browse files
committed
feat(infra): RUNNERS-count multi-runner provisioning; drop dead RUNNER_PRIVATE_IP
Add a `RUNNERS=N` knob to provision N runners on `sst deploy` with no API changes. The default runner is still auto-seeded by the API at boot; extras (runner-2..N) each get their own EC2 + minted token and are registered with the control plane after deploy via the admin API. Pairing is token-based and v2 runners self-report their address via healthcheck, so the API needs no multi-runner seed. - sst.config.ts: add the `command` provider; create extra-runner EC2s (protect:true, same ignoreChanges as the default); RegisterExtraRunners (command.local.Command) runs scripts/register-runners.mjs after the EC2s. - scripts/register-runners.mjs: poll GET /api/health, then POST /api/admin/runners per extra runner (Bearer admin key); 201 and 409 both treated as success, so redeploys are idempotent. - Remove dead RUNNER_PRIVATE_IP / runnerEndpoint wiring + the DEFAULT_RUNNER_*URL env: they only fed the v0 seed branch, which this v2 deploy never hits. Keep DEFAULT_RUNNER_NAME/_API_KEY (the v2 seed needs them). - .env.example / README: document RUNNERS + "Adding a runner"; remove the stale RUNNER_PRIVATE_IP "after first deploy" step. Known limitations (follow-ups): scale-down is not symmetric — protect:true blocks a plain `RUNNERS` decrease (needs the decommission ceremony) and the provisioner does not deregister the control-plane row; extra EC2 tags come out as `boxlite-runner-runner-N`; `npx sst install` is required to fetch the `command` provider before deploy.
1 parent 71e1754 commit 9ff7fcc

4 files changed

Lines changed: 174 additions & 36 deletions

File tree

apps/infra/.env.example

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,11 @@ OIDC_CLIENT_ID=your-spa-client-id
6464
OIDC_AUDIENCE=https://dev.example.com/api
6565

6666
# ─── 3. Runner ───────────────────────────────────────────────────────────────
67-
# [required at runtime] After the first deploy, the runner EC2 instance exists.
68-
# Get its private IP and redeploy so the API can route jobs to it. Unset → the
69-
# API points at localhost and no sandboxes work (deploy still succeeds).
70-
# aws ec2 describe-instances --region ap-southeast-1 \
71-
# --filters "Name=tag:Name,Values=boxlite-runner-default" \
72-
# --query 'Reservations[].Instances[].PrivateIpAddress' --output text
73-
RUNNER_PRIVATE_IP=
67+
# [optional] Total number of runners (the default runner is #1). Set >1 to add
68+
# more: each extra gets its own EC2 + token and is auto-registered with the
69+
# control plane during `sst deploy` (admin API). Unset = 1 (default only).
70+
# v2 runners self-report their address via healthcheck, so no IP wiring is needed.
71+
# RUNNERS=3
7472

7573
# ─── 4. SSH Gateway ──────────────────────────────────────────────────────────
7674
# [required at runtime if you use SSH] Base64-encoded Ed25519 keys. Unset → the
@@ -121,12 +119,9 @@ SSH_HOST_KEY_B64=
121119
# DASHBOARD_BASE_API_URL=https://api.dev.example.com
122120
# APP_URL=
123121

124-
# 5e. Runner wiring — derived from RUNNER_PRIVATE_IP; pin to override per-runner
125-
# endpoints.
122+
# 5e. Default runner name (rarely changed). v2 runners self-report their
123+
# address, so there are no domain/url knobs to pin.
126124
# DEFAULT_RUNNER_NAME=default
127-
# DEFAULT_RUNNER_DOMAIN=
128-
# DEFAULT_RUNNER_API_URL=
129-
# DEFAULT_RUNNER_PROXY_URL=
130125

131126
# 5f. Auth0 Management API (account linking).
132127
# Gated behind OIDC_MANAGEMENT_API_ENABLED — off by default. Create a

apps/infra/README.md

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -38,21 +38,29 @@ from the failed step.
3838

3939
## After first deploy
4040

41-
One value needs to be fed back into `.env`:
41+
Nothing needs to be fed back into `.env`. The runner EC2 self-registers with the
42+
API on boot — v2 runners report their address via healthcheck — so sandboxes
43+
work as soon as the runner reaches `READY` (~30–60s), visible in the dashboard
44+
Runner table or `GET /admin/runners`.
4245

43-
```bash
44-
# Get runner private IP:
45-
aws ec2 describe-instances --region ap-southeast-1 \
46-
--filters "Name=tag:Name,Values=boxlite-runner" \
47-
--query 'Reservations[].Instances[].PrivateIpAddress' --output text
46+
### Adding a runner
4847

49-
# Add to .env:
50-
echo "RUNNER_PRIVATE_IP=10.0.x.y" >> .env
48+
The default runner is auto-seeded by the API at boot. To run more, set the total
49+
count and redeploy:
5150

52-
# Redeploy (~2 min):
51+
```bash
52+
echo "RUNNERS=3" >> .env # default runner (#1) + runner-2 + runner-3
5353
npx sst deploy --stage dev
5454
```
5555

56+
Each extra runner gets its own EC2 + minted token. Because the API only
57+
auto-seeds the single default, the extras are registered with the control plane
58+
by a post-deploy step (`RegisterExtraRunners` in `sst.config.ts`, which runs
59+
`scripts/register-runners.mjs` against the admin API once the API is healthy).
60+
It's idempotent — re-running `sst deploy` won't duplicate rows. Scaling **down**
61+
is the deliberate decommission ceremony under [Runner lifecycle](#runner-lifecycle),
62+
applied per runner.
63+
5664
> **Note:** `CLOUDFRONT_DOMAIN` is no longer needed — SST Router resolves
5765
> it automatically via your `STACK_DOMAIN`. The dashboard's API base URL
5866
> is likewise derived: `DASHBOARD_BASE_API_URL` defaults to
@@ -339,9 +347,12 @@ self-heals.
339347
**"Organization is suspended: Please verify your email address"** — Auth0 access_token
340348
missing `email_verified` claim. Deploy the Post-Login Action described above.
341349

342-
**Runner never registers with API**`RUNNER_PRIVATE_IP` in `.env` is stale or
343-
missing. Get the current IP and redeploy. The runner also self-registers via
344-
`RUNNER_DOMAIN` set from EC2 instance metadata.
350+
**Runner never reaches `READY`** — the runner pairs to its DB row by token
351+
(`BOXLITE_RUNNER_TOKEN`, baked into the EC2's user-data, must equal the row's
352+
`apiKey`), then self-reports its address via `POST /runners/healthcheck` using
353+
`RUNNER_DOMAIN` (set from EC2 instance metadata at boot). Check the runner's
354+
systemd logs (`aws ssm start-session``journalctl -u boxlite-runner`) for auth
355+
or connectivity errors to the API.
345356

346357
**Sandbox preview URL returns 503** — Proxy service may need a force-redeploy after
347358
initial setup: `aws ecs update-service --force-new-deployment --service Proxy`.
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
// SPDX-License-Identifier: AGPL-3.0-only
2+
// Copyright (c) 2026 BoxLite AI
3+
4+
/*
5+
* Post-deploy registration of extra runners with the control plane.
6+
*
7+
* The API only auto-seeds the single default runner (from DEFAULT_RUNNER_*), so
8+
* additional runners declared via RUNNERS must be registered through the admin
9+
* API. This script is invoked by the `RegisterExtraRunners` command in
10+
* sst.config.ts after the extra runner EC2s and the API service are up.
11+
*
12+
* Pairing is token-based: the runner row's apiKey must equal the
13+
* BOXLITE_RUNNER_TOKEN baked into the matching EC2's user-data. SST mints one
14+
* token per runner and passes the (name, token) pairs here via RUNNERS.
15+
*
16+
* Idempotent: a 409 (runner already exists in the region) is treated as
17+
* success, so redeploys are safe.
18+
*
19+
* Env:
20+
* API_URL base URL of the API service (e.g. https://api.example.com)
21+
* ADMIN_API_KEY admin-scoped API key (Bearer) for POST /api/admin/runners
22+
* REGION_ID region to register the runners in (default "us")
23+
* RUNNERS JSON array of { name, apiKey }
24+
*/
25+
26+
const { API_URL, ADMIN_API_KEY, REGION_ID = 'us', RUNNERS } = process.env
27+
28+
const runners = JSON.parse(RUNNERS || '[]')
29+
if (runners.length === 0) {
30+
process.exit(0)
31+
}
32+
if (!API_URL || !ADMIN_API_KEY) {
33+
console.error('register-runners: API_URL and ADMIN_API_KEY are required')
34+
process.exit(1)
35+
}
36+
37+
const base = API_URL.replace(/\/+$/, '')
38+
const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms))
39+
40+
// Wait for the API to start serving. /api/health returns 200 only after the
41+
// HTTP server is listening, which (in onApplicationBootstrap) is after the
42+
// default region and admin user have been seeded — both prerequisites for the
43+
// admin POST below.
44+
async function waitForApi() {
45+
for (let attempt = 1; attempt <= 60; attempt++) {
46+
try {
47+
const res = await fetch(`${base}/api/health`)
48+
if (res.ok) return
49+
} catch {
50+
// not up yet
51+
}
52+
await sleep(5000)
53+
}
54+
throw new Error(`register-runners: ${base}/api/health not ready after 5 minutes`)
55+
}
56+
57+
async function register({ name, apiKey }) {
58+
const res = await fetch(`${base}/api/admin/runners`, {
59+
method: 'POST',
60+
headers: {
61+
'content-type': 'application/json',
62+
authorization: `Bearer ${ADMIN_API_KEY}`,
63+
},
64+
body: JSON.stringify({ name, apiKey, apiVersion: '2', regionId: REGION_ID }),
65+
})
66+
67+
if (res.status === 201) {
68+
console.log(`register-runners: ${name} registered`)
69+
return
70+
}
71+
if (res.status === 409) {
72+
console.log(`register-runners: ${name} already registered`)
73+
return
74+
}
75+
const body = await res.text().catch(() => '')
76+
throw new Error(`register-runners: ${name} failed (${res.status}): ${body}`)
77+
}
78+
79+
await waitForApi()
80+
for (const runner of runners) {
81+
await register(runner)
82+
}
83+
console.log(`register-runners: done (${runners.length} runner(s))`)

apps/infra/sst.config.ts

Lines changed: 61 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -90,14 +90,6 @@ const requireOidcIssuer = () => {
9090
return v;
9191
};
9292

93-
// Runner endpoint overrides — use RUNNER_PRIVATE_IP shortcut when set.
94-
const runnerEndpoint = (override: string, port: number, scheme: string) =>
95-
envOr(
96-
override,
97-
process.env.RUNNER_PRIVATE_IP
98-
? `${scheme}${process.env.RUNNER_PRIVATE_IP}:${port}`
99-
: `${scheme}localhost:${port}`,
100-
);
10193

10294
// ── app config ───────────────────────────────────────────────────────────────
10395
export default $config({
@@ -110,6 +102,8 @@ export default $config({
110102
aws: { region: REGION, profile: envOr("AWS_PROFILE", "default") },
111103
cloudflare: "6.14.0",
112104
random: "4.16.6",
105+
// Post-deploy runner registration (see RegisterExtraRunners in run()).
106+
command: "1.0.1",
113107
},
114108
};
115109
},
@@ -341,12 +335,11 @@ export default $config({
341335
...registryEnv("TRANSIENT", registry.url),
342336
...registryEnv("INTERNAL", registry.url),
343337

344-
// Default runner — wire via RUNNER_PRIVATE_IP after the first deploy
338+
// Default runner — the API seeds it at boot from name + apiKey. v2
339+
// runners self-report their address via healthcheck, so no domain/url
340+
// wiring is needed here.
345341
DEFAULT_RUNNER_NAME: envOr("DEFAULT_RUNNER_NAME", "default"),
346342
DEFAULT_RUNNER_API_KEY: envOr("DEFAULT_RUNNER_API_KEY", defaultRunnerApiKey.result),
347-
DEFAULT_RUNNER_DOMAIN: runnerEndpoint("DEFAULT_RUNNER_DOMAIN", PORTS.RUNNER, ""),
348-
DEFAULT_RUNNER_API_URL: runnerEndpoint("DEFAULT_RUNNER_API_URL", PORTS.RUNNER, "http://"),
349-
DEFAULT_RUNNER_PROXY_URL: runnerEndpoint("DEFAULT_RUNNER_PROXY_URL", PORTS.PROXY, "http://"),
350343

351344
// PostHog (enables the dashboard's "Create Sandbox" feature flag)
352345
...(process.env.POSTHOG_API_KEY && {
@@ -595,6 +588,62 @@ export default $config({
595588
ignoreChanges: ["ami", "userDataBase64"],
596589
protect: true,
597590
});
591+
592+
// ── Extra runners (RUNNERS > 1) ──────────────────────────────────────────
593+
// The default runner above is auto-seeded by the API at boot via
594+
// DEFAULT_RUNNER_*. The API has no multi-runner seed, so any additional
595+
// runners are provisioned here and registered with the control plane after
596+
// deploy via the admin API (RegisterExtraRunners below). Each gets its OWN
597+
// token — pairing is token-based (the runner row's apiKey must equal the
598+
// BOXLITE_RUNNER_TOKEN baked into the matching EC2's user-data) — and the
599+
// same protect/ignoreChanges options as the default so routine deploys never
600+
// replace a state-holding runner.
601+
const totalRunners = Math.max(1, parseInt(envOr("RUNNERS", "1"), 10) || 1);
602+
const extraRunners = Array.from({ length: totalRunners - 1 }, (_, i) => {
603+
const name = `runner-${i + 2}`; // default is runner #1
604+
const apiKey = randomKey(`RunnerApiKey-${name}`);
605+
const instance = new aws.ec2.Instance(`Runner-${name}`, {
606+
ami: ubuntuAmi.then((a) => a.id),
607+
instanceType: RUNNER.instanceType,
608+
subnetId: vpc.publicSubnets[0],
609+
iamInstanceProfile: runnerInstanceProfile.name,
610+
cpuOptions: { nestedVirtualization: "enabled" },
611+
associatePublicIpAddress: true,
612+
userDataBase64: $resolve([api.url, apiKey.result, registry.url]).apply(
613+
([apiUrl, token, registryUrl]) => buildRunnerUserData({ apiUrl, token, registryUrl }),
614+
),
615+
rootBlockDevice: { volumeSize: RUNNER.rootDiskGB },
616+
tags: { Name: `boxlite-runner-${name}` },
617+
}, {
618+
ignoreChanges: ["ami", "userDataBase64"],
619+
protect: true,
620+
});
621+
return { name, apiKey, instance };
622+
});
623+
624+
// Register the extra runners with the control plane once the API is healthy.
625+
// Idempotent (treats HTTP 409 as success), so redeploys are safe; only re-runs
626+
// when the API URL or the runner set changes.
627+
if (extraRunners.length > 0) {
628+
const runnersPayload = $resolve(extraRunners.map((r) => r.apiKey.result)).apply((keys) =>
629+
JSON.stringify(extraRunners.map((r, i) => ({ name: r.name, apiKey: keys[i] }))),
630+
);
631+
new command.local.Command(
632+
"RegisterExtraRunners",
633+
{
634+
create: "node scripts/register-runners.mjs",
635+
update: "node scripts/register-runners.mjs",
636+
environment: {
637+
API_URL: api.url,
638+
ADMIN_API_KEY: adminApiKey.result,
639+
REGION_ID: envOr("DEFAULT_REGION_ID", "us"),
640+
RUNNERS: runnersPayload,
641+
},
642+
triggers: [api.url, runnersPayload],
643+
},
644+
{ dependsOn: extraRunners.map((r) => r.instance) },
645+
);
646+
}
598647
},
599648
});
600649

0 commit comments

Comments
 (0)