|
| 1 | +# Burst Worker Cold-Ready Startup Plan |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +The burst worker launch test on June 10, 2026 showed two different timing |
| 6 | +segments: |
| 7 | + |
| 8 | +- EC2 instance creation to worker service start was roughly 90 seconds. |
| 9 | +- Worker service start to control-plane registration was much longer because |
| 10 | + startup blocked on `PrepareGoldenSnapshot`. |
| 11 | + |
| 12 | +The important observation is that the worker can be useful for cold boots |
| 13 | +before the golden snapshot is ready. The current startup path does not expose |
| 14 | +that intermediate state because the worker prepares the golden snapshot before |
| 15 | +starting its servers and heartbeat. |
| 16 | + |
| 17 | +## Goal |
| 18 | + |
| 19 | +Make a newly launched burst worker register as soon as it is cold-boot capable, |
| 20 | +while preparing the golden snapshot in the background. |
| 21 | + |
| 22 | +Target behavior: |
| 23 | + |
| 24 | +- Worker becomes schedulable for cold boots as soon as networking, env, shared |
| 25 | + mounts, gRPC, HTTP, and Redis heartbeat are ready. |
| 26 | +- Golden snapshot preparation continues asynchronously. |
| 27 | +- Once the golden snapshot is ready, the worker heartbeat advertises the golden |
| 28 | + version and the control plane can prefer it for fast creates. |
| 29 | + |
| 30 | +This does not remove EC2 launch latency. It removes golden snapshot creation |
| 31 | +from the critical path for worker registration. |
| 32 | + |
| 33 | +## Proposed Changes |
| 34 | + |
| 35 | +1. Move golden snapshot preparation out of the blocking worker startup path. |
| 36 | + |
| 37 | + Today `cmd/worker/main.go` calls `PrepareGoldenSnapshot()` before starting |
| 38 | + metadata, HTTP/gRPC, and Redis heartbeat. Move this after server startup and |
| 39 | + heartbeat setup, running in a background goroutine. |
| 40 | + |
| 41 | +2. Register the worker as cold-ready first. |
| 42 | + |
| 43 | + Heartbeat should be published with no `golden_version` until the snapshot is |
| 44 | + ready. The control plane already treats empty `golden_version` as "no golden |
| 45 | + snapshot available"; keep that meaning. |
| 46 | + |
| 47 | +3. Update heartbeat when golden prep completes. |
| 48 | + |
| 49 | + After background `PrepareGoldenSnapshot()` succeeds, call |
| 50 | + `hb.SetGoldenVersion(qemuMgr.GoldenVersion())`. The next heartbeat should |
| 51 | + update the registry. |
| 52 | + |
| 53 | +4. Add explicit logs for readiness phases. |
| 54 | + |
| 55 | + Suggested log points: |
| 56 | + |
| 57 | + - `worker cold-ready: starting heartbeat before golden snapshot` |
| 58 | + - `worker golden snapshot preparation started in background` |
| 59 | + - `worker golden-ready: version=<hash>` |
| 60 | + - `worker golden preparation failed: <err>; continuing cold-ready` |
| 61 | + |
| 62 | +5. Fix AMI/systemd ordering for burst workers. |
| 63 | + |
| 64 | + The burst AMI currently enables `opensandbox-worker.service`, so systemd can |
| 65 | + start it before user-data writes `/etc/opensandbox/worker.env`. That caused |
| 66 | + repeated `Failed to load environment files` messages during boot. |
| 67 | + |
| 68 | + Change the burst Packer file to install the worker unit but leave it |
| 69 | + disabled. User-data should start the worker exactly once after: |
| 70 | + |
| 71 | + - instance identity is known |
| 72 | + - shared volumes are attached/mounted |
| 73 | + - `/etc/opensandbox/worker.env` has been written and patched |
| 74 | + |
| 75 | +6. Keep user-data minimal. |
| 76 | + |
| 77 | + User-data should only do runtime-specific work: |
| 78 | + |
| 79 | + - fetch instance identity |
| 80 | + - attach/mount shared volumes |
| 81 | + - write env |
| 82 | + - start worker |
| 83 | + |
| 84 | + Dependency installation, binaries, OCFS2 tools, AWS CLI, QEMU, kernel |
| 85 | + modules, and rootfs assets should stay baked into the AMI. |
| 86 | + |
| 87 | +## Non-Goals |
| 88 | + |
| 89 | +- Do not change Spot instance type fallback strategy yet. |
| 90 | +- Do not try to guarantee sub-10-second readiness from a brand-new EC2 launch. |
| 91 | +- Do not implement downloaded/prebuilt QEMU memory snapshots in this pass. |
| 92 | +- Do not change public API behavior. |
| 93 | + |
| 94 | +## Expected Impact |
| 95 | + |
| 96 | +Based on the June 10 test: |
| 97 | + |
| 98 | +- Current EC2-created-to-registered time was about 6 minutes 24 seconds. |
| 99 | +- Worker service started about 91 seconds after EC2 creation. |
| 100 | +- Moving golden prep to the background could make cold-ready registration close |
| 101 | + to that worker-service-start time, likely around 90-100 seconds from EC2 |
| 102 | + creation before further AMI cleanup. |
| 103 | + |
| 104 | +With AMI/systemd cleanup, a realistic next target is roughly 45-70 seconds from |
| 105 | +EC2 creation to cold-ready in favorable cases. |
| 106 | + |
| 107 | +## Risks |
| 108 | + |
| 109 | +- Cold-ready workers may serve slower first sandboxes until golden prep |
| 110 | + completes. |
| 111 | +- Some scheduling paths may implicitly assume a non-empty `golden_version`. |
| 112 | + Those paths need review before allowing all workloads onto cold-ready workers. |
| 113 | +- Migration/checkpoint paths that require a known source golden version should |
| 114 | + continue to require it. |
| 115 | + |
| 116 | +## Validation Plan |
| 117 | + |
| 118 | +1. Build and deploy a worker with background golden prep. |
| 119 | +2. Launch a fresh burst worker and capture timestamps: |
| 120 | + - scaler launch decision |
| 121 | + - EC2 instance created |
| 122 | + - user-data start |
| 123 | + - worker service start |
| 124 | + - first Redis heartbeat / CP registration |
| 125 | + - golden snapshot ready |
| 126 | +3. Confirm the CP sees the worker before golden snapshot readiness. |
| 127 | +4. Create a sandbox on the cold-ready worker and verify it succeeds via cold |
| 128 | + boot. |
| 129 | +5. Wait for golden-ready heartbeat and verify subsequent creates use the golden |
| 130 | + path. |
| 131 | +6. Terminate the extra worker after the test to avoid unnecessary cost. |
0 commit comments