Skip to content

Commit 20d1619

Browse files
Mohamed HabibMohamed Habib
authored andcommitted
feat: prepare burst workers cold-ready
1 parent 9d7553d commit 20d1619

9 files changed

Lines changed: 740 additions & 45 deletions

File tree

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Burst Worker Cold-Ready Startup Plan
2+
3+
## Context
4+
5+
The burst worker launch test on June 10, 2026 showed two different timing
6+
segments:
7+
8+
- EC2 instance creation to worker service start was roughly 90 seconds.
9+
- Worker service start to control-plane registration was much longer because
10+
startup blocked on `PrepareGoldenSnapshot`.
11+
12+
The important observation is that the worker can be useful for cold boots
13+
before the golden snapshot is ready. The current startup path does not expose
14+
that intermediate state because the worker prepares the golden snapshot before
15+
starting its servers and heartbeat.
16+
17+
## Goal
18+
19+
Make a newly launched burst worker register as soon as it is cold-boot capable,
20+
while preparing the golden snapshot in the background.
21+
22+
Target behavior:
23+
24+
- Worker becomes schedulable for cold boots as soon as networking, env, shared
25+
mounts, gRPC, HTTP, and Redis heartbeat are ready.
26+
- Golden snapshot preparation continues asynchronously.
27+
- Once the golden snapshot is ready, the worker heartbeat advertises the golden
28+
version and the control plane can prefer it for fast creates.
29+
30+
This does not remove EC2 launch latency. It removes golden snapshot creation
31+
from the critical path for worker registration.
32+
33+
## Proposed Changes
34+
35+
1. Move golden snapshot preparation out of the blocking worker startup path.
36+
37+
Today `cmd/worker/main.go` calls `PrepareGoldenSnapshot()` before starting
38+
metadata, HTTP/gRPC, and Redis heartbeat. Move this after server startup and
39+
heartbeat setup, running in a background goroutine.
40+
41+
2. Register the worker as cold-ready first.
42+
43+
Heartbeat should be published with no `golden_version` until the snapshot is
44+
ready. The control plane already treats empty `golden_version` as "no golden
45+
snapshot available"; keep that meaning.
46+
47+
3. Update heartbeat when golden prep completes.
48+
49+
After background `PrepareGoldenSnapshot()` succeeds, call
50+
`hb.SetGoldenVersion(qemuMgr.GoldenVersion())`. The next heartbeat should
51+
update the registry.
52+
53+
4. Add explicit logs for readiness phases.
54+
55+
Suggested log points:
56+
57+
- `worker cold-ready: starting heartbeat before golden snapshot`
58+
- `worker golden snapshot preparation started in background`
59+
- `worker golden-ready: version=<hash>`
60+
- `worker golden preparation failed: <err>; continuing cold-ready`
61+
62+
5. Fix AMI/systemd ordering for burst workers.
63+
64+
The burst AMI currently enables `opensandbox-worker.service`, so systemd can
65+
start it before user-data writes `/etc/opensandbox/worker.env`. That caused
66+
repeated `Failed to load environment files` messages during boot.
67+
68+
Change the burst Packer file to install the worker unit but leave it
69+
disabled. User-data should start the worker exactly once after:
70+
71+
- instance identity is known
72+
- shared volumes are attached/mounted
73+
- `/etc/opensandbox/worker.env` has been written and patched
74+
75+
6. Keep user-data minimal.
76+
77+
User-data should only do runtime-specific work:
78+
79+
- fetch instance identity
80+
- attach/mount shared volumes
81+
- write env
82+
- start worker
83+
84+
Dependency installation, binaries, OCFS2 tools, AWS CLI, QEMU, kernel
85+
modules, and rootfs assets should stay baked into the AMI.
86+
87+
## Non-Goals
88+
89+
- Do not change Spot instance type fallback strategy yet.
90+
- Do not try to guarantee sub-10-second readiness from a brand-new EC2 launch.
91+
- Do not implement downloaded/prebuilt QEMU memory snapshots in this pass.
92+
- Do not change public API behavior.
93+
94+
## Expected Impact
95+
96+
Based on the June 10 test:
97+
98+
- Current EC2-created-to-registered time was about 6 minutes 24 seconds.
99+
- Worker service started about 91 seconds after EC2 creation.
100+
- Moving golden prep to the background could make cold-ready registration close
101+
to that worker-service-start time, likely around 90-100 seconds from EC2
102+
creation before further AMI cleanup.
103+
104+
With AMI/systemd cleanup, a realistic next target is roughly 45-70 seconds from
105+
EC2 creation to cold-ready in favorable cases.
106+
107+
## Risks
108+
109+
- Cold-ready workers may serve slower first sandboxes until golden prep
110+
completes.
111+
- Some scheduling paths may implicitly assume a non-empty `golden_version`.
112+
Those paths need review before allowing all workloads onto cold-ready workers.
113+
- Migration/checkpoint paths that require a known source golden version should
114+
continue to require it.
115+
116+
## Validation Plan
117+
118+
1. Build and deploy a worker with background golden prep.
119+
2. Launch a fresh burst worker and capture timestamps:
120+
- scaler launch decision
121+
- EC2 instance created
122+
- user-data start
123+
- worker service start
124+
- first Redis heartbeat / CP registration
125+
- golden snapshot ready
126+
3. Confirm the CP sees the worker before golden snapshot readiness.
127+
4. Create a sandbox on the cold-ready worker and verify it succeeds via cold
128+
boot.
129+
5. Wait for golden-ready heartbeat and verify subsequent creates use the golden
130+
path.
131+
6. Terminate the extra worker after the test to avoid unnecessary cost.

cmd/server/main.go

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -501,17 +501,20 @@ func main() {
501501

502502
scalerState := controlplane.NewRedisScalerState(redisRegistry.RedisClient())
503503
scaler := controlplane.NewScaler(controlplane.ScalerConfig{
504-
Pool: pool,
505-
Registry: redisRegistry,
506-
Store: opts.Store,
507-
StateStore: scalerState,
508-
WorkerImage: workerImage,
509-
Cooldown: time.Duration(cfg.ScaleCooldownSec) * time.Second,
510-
MinWorkers: cfg.MinWorkersPerRegion,
511-
MaxWorkers: cfg.MaxWorkersPerRegion,
512-
IdleReserve: cfg.IdleReserveWorkers,
513-
WorkerPool: cfg.WorkerPool,
514-
MachineSizes: machineSizes,
504+
Pool: pool,
505+
Registry: redisRegistry,
506+
Store: opts.Store,
507+
StateStore: scalerState,
508+
WorkerImage: workerImage,
509+
Cooldown: time.Duration(cfg.ScaleCooldownSec) * time.Second,
510+
MinWorkers: cfg.MinWorkersPerRegion,
511+
MaxWorkers: cfg.MaxWorkersPerRegion,
512+
IdleReserve: cfg.IdleReserveWorkers,
513+
MinIdleCapacity: cfg.MinIdleCapacity,
514+
MinIdleCPUs: cfg.MinIdleCPUs,
515+
DefaultSandboxCPUs: cfg.DefaultSandboxCPUs,
516+
WorkerPool: cfg.WorkerPool,
517+
MachineSizes: machineSizes,
515518
// For "migrated" event emit after scaler-driven migrations
516519
// (rolling replace, evacuation) — keeps D1 sandboxes_index
517520
// worker_id in sync with cell-PG truth. Without this, the

internal/config/config.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,8 @@ type Config struct {
154154
MinWorkersPerRegion int // Minimum total workers per region, default 1
155155
MaxWorkersPerRegion int // Maximum workers per region (hard cap), default 10
156156
IdleReserveWorkers int // Target idle workers for burst absorption, default 1
157+
MinIdleCapacity int // Minimum spare sandbox capacity slots per region. When >0, overrides MinWorkers/IdleReserve.
158+
MinIdleCPUs int // Minimum spare sandbox CPU units per region. When >0, overrides MinIdleCapacity.
157159

158160
// Stripe billing
159161
StripeSecretKey string
@@ -395,6 +397,8 @@ func Load() (*Config, error) {
395397
MinWorkersPerRegion: envOrDefaultInt("OPENSANDBOX_MIN_WORKERS", 1),
396398
MaxWorkersPerRegion: envOrDefaultInt("OPENSANDBOX_MAX_WORKERS", 10),
397399
IdleReserveWorkers: envOrDefaultInt("OPENSANDBOX_IDLE_RESERVE", 1),
400+
MinIdleCapacity: envOrDefaultInt("OPENSANDBOX_MIN_IDLE_CAPACITY", 0),
401+
MinIdleCPUs: envOrDefaultInt("OPENSANDBOX_MIN_IDLE_CPUS", 0),
398402

399403
StripeSecretKey: os.Getenv("STRIPE_SECRET_KEY"),
400404
StripeWebhookSecret: os.Getenv("STRIPE_WEBHOOK_SECRET"),

internal/config/secrets.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,8 @@ var secretMapping = map[string]string{
7575
"server-sentry-dsn": "OPENSANDBOX_SENTRY_DSN",
7676
"server-azure-vm-sizes": "OPENSANDBOX_AZURE_VM_SIZES",
7777
"server-ec2-instance-types": "OPENSANDBOX_EC2_INSTANCE_TYPES",
78+
"server-min-idle-capacity": "OPENSANDBOX_MIN_IDLE_CAPACITY",
79+
"server-min-idle-cpus": "OPENSANDBOX_MIN_IDLE_CPUS",
7880
"server-ocfs2-node-ips": "OPENSANDBOX_OCFS2_NODE_IPS",
7981
"server-s3-access-key": "OPENSANDBOX_S3_ACCESS_KEY_ID",
8082
"server-s3-secret-key": "OPENSANDBOX_S3_SECRET_ACCESS_KEY",

0 commit comments

Comments
 (0)