|
| 1 | +# `tools/sandbox-bench/` — measured per-test overhead, three test paths |
| 2 | + |
| 3 | +> **Note 2026-06-08**: the Hyper-V hot-revert and portability primitives |
| 4 | +> measured by the scripts in this directory are now first-class API |
| 5 | +> calls in `vm-harness`: |
| 6 | +> |
| 7 | +> - `snapshotRunning` / `restoreSnapshot` / `removeSnapshot` |
| 8 | +> - `exportBaseline` / `importBaseline` |
| 9 | +> - CLI: `vm-harness snapshot create --running ...`, |
| 10 | +> `vm-harness baseline export|import ...` |
| 11 | +> |
| 12 | +> A backend-agnostic Nim benchmark that drives the same measurement |
| 13 | +> through the library API ships at |
| 14 | +> `metacraft-labs/vm-harness:tools/bench/snapshot_revert_bench.nim` |
| 15 | +> (build: `nimble buildBench`). The PowerShell scripts in this |
| 16 | +> directory remain the canonical reproducer for the original |
| 17 | +> measurements documented here, but new measurement work should |
| 18 | +> prefer the vm-harness bench so future backends (Tart `suspend`, |
| 19 | +> libvirt native snapshots) can re-measure under the same harness. |
| 20 | +> See `docs/per-backend-notes/hyperv-snapshot-benchmarks.md` in |
| 21 | +> vm-harness for the project-agnostic version of the numbers below. |
| 22 | +
|
| 23 | +Concrete wall-time measurements for the three options reprobuild has |
| 24 | +to run a destructive-class e2e test: |
| 25 | + |
| 26 | +| Path | Per-test overhead | Per-test isolation? | Batching unit | Isolation surface | |
| 27 | +|-------------|-------------------|---------------------|---------------|-------------------| |
| 28 | +| Bare host | 0 | n/a (with REPRO_REGISTRY_ROOT seam: yes for HKCU writes) | n/a | none — uses real HKCU/FS/PATH | |
| 29 | +| Windows Sandbox per-test | ~12 s | **yes** | one .wsb per test | fresh HKCU + FS, no Windows Update, no reboot | |
| 30 | +| Windows Sandbox batched in one session | ~0 amortized | **no** (no in-session reset; cleanup discipline required) | one .wsb session, many tests | same as above, but tests share state | |
| 31 | +| Hyper-V VM cold-boot per test | ~27–46 s | **yes** | one snapshot revert per test (VM stop+start) | fresh HKCU + FS + Windows Update + reboot | |
| 32 | +| Hyper-V VM **hot-snapshot revert** per test | **~5.4 s** | **yes** | one Restore-VMCheckpoint per test (no host stop) | same as cold-boot, but VM stays alive between tests | |
| 33 | +| Hyper-V Save-VM/Start-VM (hibernate) | ~4.6 s | **no** (resume preserves state — no reset) | warm-restart only | same as cold-boot | |
| 34 | + |
| 35 | +The Sandbox and Hyper-V numbers below were collected on this dev host |
| 36 | +(2026-06-06) with the harnesses in this dir. They are reproducible: |
| 37 | +`run-sandbox-bench.ps1` and `run-hyperv-bench.ps1` write their raw |
| 38 | +timestamps to `D:\metacraft\sandbox-bench-out\TIMINGS*.txt`. |
| 39 | + |
| 40 | +## Measurement methodology |
| 41 | + |
| 42 | +Both harnesses run a payload inside their isolation environment and |
| 43 | +record host-side and in-env timestamps at fixed checkpoints. The |
| 44 | +"wall time" is from script start (host) to DONE/Stopped (host). |
| 45 | + |
| 46 | +The payload (the actual test) is **deliberately the same** across |
| 47 | +environments so the per-environment overhead is comparable. The test |
| 48 | +binary, `t_integration_plan_classifier_bucket_drift_is_cache_hit.exe`, |
| 49 | +is pre-built on the host once and copied/mapped into each environment. |
| 50 | + |
| 51 | +Caveat for fair comparison: the m80 test calls `installScoopAppAtVersion` |
| 52 | +which needs a real `scoop.ps1` on PATH. The sandbox image doesn't have |
| 53 | +scoop installed, so the test fast-fails at the `resolveScoopBinary` |
| 54 | +assertion (89 ms). The bare-host run reaches further (1.75–2.9 s) before |
| 55 | +failing on a different assertion. **For the apples-to-apples overhead |
| 56 | +number, treat the test wall time as a constant and read the difference** |
| 57 | +between the environment's total wall time and that constant. |
| 58 | + |
| 59 | +## Windows Sandbox: measured ~12 s overhead floor (2026-06-06) |
| 60 | + |
| 61 | +``` |
| 62 | +T0_wsb_launch = 0.000 s |
| 63 | +T1_logon_fired_host_observed = +9.086 s ← Sandbox cold-boot + LogonCommand |
| 64 | +T2_script_started = +11.438 s ← cmd.exe → powershell.exe handoff |
| 65 | +T3_vc_staged = +11.747 s ← VC++ DLL copy to System32 (~0.3 s) |
| 66 | +T4_test_started = +11.790 s ← stage test exe + sqlite + repro |
| 67 | +T5_test_finished = +11.900 s ← test ran (89 ms — fast-fail on missing scoop) |
| 68 | +T6_done = +11.928 s |
| 69 | +TOTAL host wall = 12.072 s |
| 70 | +``` |
| 71 | + |
| 72 | +Cost breakdown: |
| 73 | + |
| 74 | +| Phase | Cost | Comment | |
| 75 | +|---|---|---| |
| 76 | +| Sandbox cold-boot | ~9 s | Win11 Sandbox is faster than reputation; on older hosts expect 30–60 s | |
| 77 | +| Cmd→PowerShell hop | ~2 s | LogonCommand is `cmd.exe /c` for diagnostic reasons (see `migration.wsb` header) | |
| 78 | +| VC++ stage | <1 s | Copying 7 DLLs from the mapped folder to System32 | |
| 79 | +| Test wall | = bare-host wall | Sandbox's CPU is host-equivalent; the test runs at host speed | |
| 80 | + |
| 81 | +**Implication.** Per-test sandbox cost is ~12 s + test wall time. So: |
| 82 | +- 1-second test → 12 s overhead → **13x** |
| 83 | +- 10-second test → 12 s overhead → **2.2x** |
| 84 | +- 60-second test → 12 s overhead → **1.2x** |
| 85 | +- 100 tests batched into ONE sandbox session → 12 s amortized across 100 → **+0.12 s per test** |
| 86 | + |
| 87 | +The cost is the cold-boot, not the per-test work. Batching is the |
| 88 | +right answer if Sandbox isolation suffices. |
| 89 | + |
| 90 | +## Hyper-V VM: measured ~29 s overhead floor (2026-06-06) |
| 91 | + |
| 92 | +Reusing the M69 harness VM `repro-m69-hyperv` reverted to its `base-clean` |
| 93 | +snapshot: |
| 94 | + |
| 95 | +``` |
| 96 | +T0_start = 0.000 s |
| 97 | +T1_revert_done = +0.189 s ← Restore-VMCheckpoint (diff-layer drop) |
| 98 | +T2_psdirect_ready = +26.592 s ← Start-VM + Windows boot + PSDirect handshake |
| 99 | +T3_stage_done = +26.760 s ← Copy-VMFile a tiny payload (host → guest) |
| 100 | +T4_invoke_done = +28.315 s ← Invoke-Command -VMName ran Get-Date + Get-Content |
| 101 | +T5_stopped = +28.582 s ← Stop-VM -TurnOff |
| 102 | +TOTAL host wall = 28.586 s |
| 103 | +``` |
| 104 | + |
| 105 | +Cost breakdown: |
| 106 | + |
| 107 | +| Phase | Cost | Comment | |
| 108 | +|---|---|---| |
| 109 | +| Restore-VMCheckpoint | ~0.2 s | The differencing-disk revert is a metadata flip | |
| 110 | +| Start-VM + boot to PSDirect | ~26 s | Full Windows guest boot to where `Invoke-Command -VMName { hostname }` succeeds | |
| 111 | +| Copy-VMFile stage | <0.2 s | tiny file; scales with payload size | |
| 112 | +| Invoke-Command round-trip | ~1.5 s | PSDirect channel overhead per RPC, not the command itself | |
| 113 | +| Stop-VM -TurnOff | ~0.3 s | Hard power-off; no clean shutdown | |
| 114 | + |
| 115 | +**Implication.** Per-test Hyper-V cost is ~29 s + test wall time + ~1.5 s |
| 116 | +per Invoke-Command round-trip (so if the per-test runner stages, runs, |
| 117 | +collects logs as three separate Invoke-Commands, that's ~4.5 s of RPC |
| 118 | +overhead on top of the 29 s boot). |
| 119 | + |
| 120 | +Comparison (cold-boot path): |
| 121 | +- 1-second test → Hyper-V = +29 s → **30x** |
| 122 | +- 10-second test → Hyper-V = +29 s → **3.9x** |
| 123 | +- 60-second test → Hyper-V = +29 s → **1.5x** |
| 124 | +- 100 tests batched into ONE Hyper-V session → 29 s amortized → **+0.29 s per test** |
| 125 | + |
| 126 | +Hyper-V is ~2.4x slower per session than Sandbox (29 s vs 12 s) but |
| 127 | +provides full Windows isolation including Windows Update access, |
| 128 | +persistent disk, and reboot capability — the three things Sandbox |
| 129 | +cannot provide. |
| 130 | + |
| 131 | +## Hyper-V VM with HOT-snapshot revert: measured ~5.4 s per test (2026-06-08) |
| 132 | + |
| 133 | +Standard Checkpoints in Hyper-V capture the memory + CPU + device state |
| 134 | +of a RUNNING VM. `Restore-VMCheckpoint` to such a snapshot returns the |
| 135 | +VM to that exact running state — no Windows boot, no re-OOBE, |
| 136 | +no rebuilding of the Win32 subsystem. The existing `base-clean` |
| 137 | +snapshot is a cold snapshot (taken with the VM Off) so it has no memory |
| 138 | +state; `run-hyperv-bench-hot.ps1` takes a fresh `base-hot` snapshot |
| 139 | +once with the VM running, then measures the revert cycle. |
| 140 | + |
| 141 | +``` |
| 142 | +Phase A — one-time setup: |
| 143 | + A0_start = 0.000 s |
| 144 | + A1_first_boot_done = +46.468 s ← cold boot, only paid ONCE |
| 145 | + A2_hot_snapshot_taken = +2.220 s ← captures RAM + CPU + devices |
| 146 | +
|
| 147 | +Phase B — revert-from-hot × 3 iterations: |
| 148 | + iter1: restore 4.16 s + PSDirect 0.94 s = 5.10 s |
| 149 | + iter2: restore 4.72 s + PSDirect 0.93 s = 5.65 s |
| 150 | + iter3: restore 4.51 s + PSDirect 0.97 s = 5.48 s |
| 151 | + AVERAGE = 5.41 s |
| 152 | +
|
| 153 | +Phase C — Save-VM / Start-VM (hibernate, NOT a reset): |
| 154 | + C1_save_returned = +1.673 s ← writes RAM to disk |
| 155 | + C2_start_returned = +2.003 s ← reads RAM back |
| 156 | + C3_psdirect_ready = +0.943 s |
| 157 | + TOTAL = 4.62 s |
| 158 | +``` |
| 159 | + |
| 160 | +**This changes the routine-CI picture.** With hot-snapshot revert: |
| 161 | + |
| 162 | +| Test wall | Bare host | Sandbox (per-test) | Hyper-V (hot revert) | |
| 163 | +|---|---|---|---| |
| 164 | +| 1 s | 1 s | 13 s (13×) | **6.4 s (6.4×)** | |
| 165 | +| 10 s | 10 s | 22 s (2.2×) | **15.4 s (1.5×)** | |
| 166 | +| 60 s | 60 s | 72 s (1.2×) | **65.4 s (1.1×)** | |
| 167 | +| 100 batched | 100 t | +0.12 s/test | **46 s setup + 100 × (5.4 + test_wall)** | |
| 168 | + |
| 169 | +Hyper-V with hot revert is **competitive with per-test Sandbox** for |
| 170 | +sub-minute tests, AND it gives every test full pristine state without |
| 171 | +needing in-test cleanup discipline. For tests requiring DISM / Windows |
| 172 | +Update / reboot it's the only option — and the cost is no longer |
| 173 | +prohibitive. |
| 174 | + |
| 175 | +**Save-VM / Start-VM is a different tool.** It's hibernate: state is |
| 176 | +preserved across the cycle, so it doesn't give you a reset. Useful only |
| 177 | +for "warm restart this same state" workflows (e.g., resume after a |
| 178 | +host-side power blip during a long test session). Don't confuse it |
| 179 | +with hot-snapshot revert. |
| 180 | + |
| 181 | +**Sandbox has no equivalent.** Windows Sandbox is a Hyper-V-isolated |
| 182 | +container, but its lifecycle is wrapped by the Sandbox Manager which |
| 183 | +exposes no save/checkpoint API. There is no `Save-Sandbox` cmdlet, no |
| 184 | +in-config checkpoint directive, and no `*-Sandbox` PowerShell command |
| 185 | +beyond launching one via `WindowsSandbox.exe <wsb-file>`. Mapped |
| 186 | +writable folders are the only state that survives a session. So the |
| 187 | +12 s Sandbox cost is per-session, full stop — you can't amortize it |
| 188 | +the way you can with Hyper-V hot revert. |
| 189 | + |
| 190 | +## Hyper-V hot checkpoints are portable across hosts (2026-06-08) |
| 191 | + |
| 192 | +`run-hyperv-bench-portable.ps1` exports a VM with a hot Standard |
| 193 | +Checkpoint, then imports it back as a new VM with a fresh ID and |
| 194 | +times the resume cycle: |
| 195 | + |
| 196 | +``` |
| 197 | +Phase A (one-time setup, paid once per cached image): |
| 198 | + First boot to PSDirect 43.838 s |
| 199 | + Checkpoint-VM (Standard, hot) 2.111 s |
| 200 | + Stop-VM 0.313 s |
| 201 | +
|
| 202 | +Phase B (Export-VM): |
| 203 | + Export-VM returned 1.737 s (same-volume reflink) |
| 204 | + export_total_gb 53.21 GB |
| 205 | + .vhdx files 52.53 GB (2 files, base + diff) |
| 206 | + .avhdx files 1.25 GB (snapshot diffs) |
| 207 | + .VMRS files (memory state) 0.69 GB (3 files; the big one is the hot checkpoint's RAM image) |
| 208 | + .vmgs files 0.01 GB |
| 209 | + .vmcx files ~120 KB |
| 210 | +
|
| 211 | +Phase C (Import-VM and resume on the IMPORTED VM): |
| 212 | + Import-VM 3.023 s |
| 213 | + imported_snapshot_names = base-clean, exp-hot ← both came through |
| 214 | + Restore-VMCheckpoint exp-hot 0.128 s |
| 215 | + Start-VM (memory resume) 3.740 s |
| 216 | + PSDirect ready 0.979 s |
| 217 | + TOTAL import+resume 7.870 s |
| 218 | +``` |
| 219 | + |
| 220 | +**Bottom line:** |
| 221 | +- `.VMRS` files are the snapshot's memory + CPU + device state, and they ARE included in `Export-VM`. |
| 222 | +- `Import-VM` brings back the full snapshot tree. |
| 223 | +- `Restore-VMCheckpoint` to a hot snapshot on the imported VM works the same as on the original. |
| 224 | +- Same-volume export uses reflinks/hardlinks for VHDX files; **the real cross-host payload is ~10 GB** (the VHDX content) + 0.7 GB (memory state) + ~13 MB (config) ≈ **10.7 GB uncompressed**. VHDX content is highly compressible (lots of zeros from sparse provisioning). |
| 225 | + |
| 226 | +**CI artifact-caching model:** |
| 227 | +- ONE CI runner (the "warmer") pays the 44 s boot cost ONCE, takes the hot checkpoint, exports the VM, compresses the export folder, and uploads it as a CI artifact. |
| 228 | +- Every other runner pulls the artifact, decompresses, `Import-VM`s, `Restore-VMCheckpoint`s to the hot snapshot, `Start-VM`s. Total runner-side cost on a warm machine: **~8 s**. |
| 229 | +- Per-test cost on the imported VM: ~5.4 s (the same hot-revert cycle). |
| 230 | + |
| 231 | +**Cross-host caveats:** |
| 232 | +- **CPU compatibility.** Memory-state snapshots capture CPU registers and feature flags. Importing on a CPU that lacks features the snapshot expects (e.g. older AVX support) may fail or produce subtle errors. Hyper-V has a "Migrate to a physical computer with a different processor version" option on VM CPU config that masks features down to a baseline — set this on the warmer VM if the CI fleet is heterogeneous. |
| 233 | +- **Hyper-V version skew.** A newer Hyper-V's export should import on the same or newer version; downgrade is not supported. |
| 234 | +- **Generation 1 vs 2.** Same generation in both ends. The harness VM here is Gen 2. |
| 235 | + |
| 236 | +## When portability is worth the bother |
| 237 | + |
| 238 | +It's worth it when: |
| 239 | +- The CI fleet has many runners and the per-runner boot cost (44 s) sums to a real wall-clock loss. |
| 240 | +- The test suite needs Hyper-V isolation (DISM, reboots, VS Installer — see `tools/hyperv-m69-system/README.md`) and so can't fall back to Sandbox. |
| 241 | +- The runners can store ~10 GB of cached image. |
| 242 | + |
| 243 | +It's NOT worth it when: |
| 244 | +- The whole suite is bare-host eligible (REPRO_REGISTRY_ROOT + per-test tempdirs cover it). |
| 245 | +- There are <10 runners in the fleet — the warmer-runner cost amortizes badly. |
| 246 | +- The test wall time per runner is dominated by per-test work, not by the one-time boot. |
| 247 | + |
| 248 | +The first-time provisioning cost (downloading the 20-50 GB Windows 11 |
| 249 | +dev VHDX, running OOBE, uninstalling VS, installing Nim/gcc, snapshotting) |
| 250 | +is **NOT** in the per-test overhead; it's a one-time bootstrap. See |
| 251 | +`tools/hyperv-m69-system/README.md`. |
| 252 | + |
| 253 | +## Which path to pick |
| 254 | + |
| 255 | +The four paths are complementary, not interchangeable. Use this |
| 256 | +decision table: |
| 257 | + |
| 258 | +| Test class | Use | |
| 259 | +|---|---| |
| 260 | +| Touches process-local state only (no HKCU/PATH/services) | bare host | |
| 261 | +| Writes to HKCU (env.userPath, registry resources) | bare host with REPRO_REGISTRY_ROOT (see project memory) — the leak fix supersedes the need to sandbox these | |
| 262 | +| Touches files in stable system paths (Program Files, ProgramData), needs per-test pristine | Hyper-V hot-revert — runs at 5.4 s/test with full reset | |
| 263 | +| Touches files BUT tests can be ordered/grouped so they don't collide | Sandbox per-test (12 s) OR Sandbox batched (if cleanup discipline is real) | |
| 264 | +| Needs DISM / OptionalFeature / Capability / WSL / VS Installer / reboot | Hyper-V VM (`tools/hyperv-m69-system/`) — Sandbox cannot provide Windows Update, persistent disk, or reboot capability | |
| 265 | +| Needs full Linux destructive scope | throwaway WSL (separate harness; see destructive-gate environments memo) | |
| 266 | + |
| 267 | +The combination of REPRO_REGISTRY_ROOT (driver-level seam, 0 overhead) |
| 268 | ++ Hyper-V hot-revert (5.4 s/test, full pristine state) covers the |
| 269 | +vast majority of the destructive-test surface without per-test cleanup |
| 270 | +discipline. Use Sandbox where its lower memory footprint (~4 GB vs |
| 271 | +Hyper-V's whole guest OS) matters more than the per-test isolation |
| 272 | +gap. |
| 273 | + |
| 274 | +Sandbox and Hyper-V both isolate from the host, but Sandbox can't run |
| 275 | +Windows Update / reboot / install VS. That's the dividing line |
| 276 | +documented in `tools/hyperv-m69-system/README.md` § "Why Hyper-V (and |
| 277 | +not Sandbox)" — quoting the empirical record (DISM payload fetch fails; |
| 278 | +VS Build Tools >1 hour; no reboots). |
| 279 | + |
| 280 | +## Files |
| 281 | + |
| 282 | +| File | Runs on | Purpose | |
| 283 | +|---|---|---| |
| 284 | +| `bench.wsb` | host | Windows Sandbox config; mapped folders + LogonCommand | |
| 285 | +| `provision-and-bench.ps1` | inside Sandbox | Stages VC++ DLLs and runs the bench payload; writes TIMINGS.txt | |
| 286 | +| `run-sandbox-bench.ps1` | host | Launches the sandbox, polls for DONE, reports timing | |
| 287 | +| `run-hyperv-bench.ps1` | host | Reverts the M69 harness VM (cold path), runs a trivial payload, reports timing | |
| 288 | +| `run-hyperv-bench-hot.ps1` | host | Takes a hot Standard Checkpoint, measures revert-to-running and Save-VM/Start-VM cycles | |
| 289 | +| `run-hyperv-bench-portable.ps1` | host | Round-trips a hot checkpoint through Export-VM / Import-VM; proves portability and reports import+resume cost | |
| 290 | +| `README.md` | — | This file | |
0 commit comments