Skip to content

Commit 4a3c8a2

Browse files
committed
tools(sandbox-bench): measured per-test overhead, three test paths
Reproducible PowerShell harnesses + decision-table README for the three options reprobuild has to run a destructive-class e2e test: - bare host 0 s per-test (no isolation) - Windows Sandbox per-test ~12 s per-test (fresh HKCU+FS, no Windows Update) - Hyper-V VM cold-boot ~29 s per-test (full Windows isolation) - Hyper-V VM HOT-snapshot revert ~5.4 s per-test (memory-image load) - Hyper-V portable exported ~8 s import+resume on a fresh runner Files: bench.wsb Windows Sandbox config provision-and-bench.ps1 in-Sandbox provisioner (timestamps T0-T6) run-sandbox-bench.ps1 host-side Sandbox launcher run-hyperv-bench.ps1 cold-boot Hyper-V revert (29 s) run-hyperv-bench-hot.ps1 Standard Checkpoint hot revert (5.4 s) + Save-VM/Start-VM hibernate (4.6 s) run-hyperv-bench-portable.ps1 Export-VM round-trip portability (8 s) README.md decision table, methodology, raw numbers The bench primitives measured here have been promoted to first-class API calls in metacraft-labs/vm-harness: snapshotRunning / restoreSnapshot / removeSnapshot exportBaseline / importBaseline CLI: `vm-harness snapshot create --running ...` `vm-harness baseline export|import ...` A backend-agnostic Nim bench driving the same measurements through the vm-harness library lives at vm-harness:tools/bench/. The PowerShell scripts here remain the canonical reproducers for the original numbers; new measurement work should prefer the vm-harness bench so future backends (Tart suspend, libvirt native snapshots) get the same harness for free. See AGENTS.md memories: - project_reprobuild_destructive_test_overhead.md - project_reprobuild_user_path_pollution.md
1 parent bc3f328 commit 4a3c8a2

7 files changed

Lines changed: 1048 additions & 0 deletions

File tree

tools/sandbox-bench/README.md

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# `tools/sandbox-bench/` — measured per-test overhead, three test paths
2+
3+
> **Note 2026-06-08**: the Hyper-V hot-revert and portability primitives
4+
> measured by the scripts in this directory are now first-class API
5+
> calls in `vm-harness`:
6+
>
7+
> - `snapshotRunning` / `restoreSnapshot` / `removeSnapshot`
8+
> - `exportBaseline` / `importBaseline`
9+
> - CLI: `vm-harness snapshot create --running ...`,
10+
> `vm-harness baseline export|import ...`
11+
>
12+
> A backend-agnostic Nim benchmark that drives the same measurement
13+
> through the library API ships at
14+
> `metacraft-labs/vm-harness:tools/bench/snapshot_revert_bench.nim`
15+
> (build: `nimble buildBench`). The PowerShell scripts in this
16+
> directory remain the canonical reproducer for the original
17+
> measurements documented here, but new measurement work should
18+
> prefer the vm-harness bench so future backends (Tart `suspend`,
19+
> libvirt native snapshots) can re-measure under the same harness.
20+
> See `docs/per-backend-notes/hyperv-snapshot-benchmarks.md` in
21+
> vm-harness for the project-agnostic version of the numbers below.
22+
23+
Concrete wall-time measurements for the three options reprobuild has
24+
to run a destructive-class e2e test:
25+
26+
| Path | Per-test overhead | Per-test isolation? | Batching unit | Isolation surface |
27+
|-------------|-------------------|---------------------|---------------|-------------------|
28+
| Bare host | 0 | n/a (with REPRO_REGISTRY_ROOT seam: yes for HKCU writes) | n/a | none — uses real HKCU/FS/PATH |
29+
| Windows Sandbox per-test | ~12 s | **yes** | one .wsb per test | fresh HKCU + FS, no Windows Update, no reboot |
30+
| Windows Sandbox batched in one session | ~0 amortized | **no** (no in-session reset; cleanup discipline required) | one .wsb session, many tests | same as above, but tests share state |
31+
| Hyper-V VM cold-boot per test | ~27–46 s | **yes** | one snapshot revert per test (VM stop+start) | fresh HKCU + FS + Windows Update + reboot |
32+
| Hyper-V VM **hot-snapshot revert** per test | **~5.4 s** | **yes** | one Restore-VMCheckpoint per test (no host stop) | same as cold-boot, but VM stays alive between tests |
33+
| Hyper-V Save-VM/Start-VM (hibernate) | ~4.6 s | **no** (resume preserves state — no reset) | warm-restart only | same as cold-boot |
34+
35+
The Sandbox and Hyper-V numbers below were collected on this dev host
36+
(2026-06-06) with the harnesses in this dir. They are reproducible:
37+
`run-sandbox-bench.ps1` and `run-hyperv-bench.ps1` write their raw
38+
timestamps to `D:\metacraft\sandbox-bench-out\TIMINGS*.txt`.
39+
40+
## Measurement methodology
41+
42+
Both harnesses run a payload inside their isolation environment and
43+
record host-side and in-env timestamps at fixed checkpoints. The
44+
"wall time" is from script start (host) to DONE/Stopped (host).
45+
46+
The payload (the actual test) is **deliberately the same** across
47+
environments so the per-environment overhead is comparable. The test
48+
binary, `t_integration_plan_classifier_bucket_drift_is_cache_hit.exe`,
49+
is pre-built on the host once and copied/mapped into each environment.
50+
51+
Caveat for fair comparison: the m80 test calls `installScoopAppAtVersion`
52+
which needs a real `scoop.ps1` on PATH. The sandbox image doesn't have
53+
scoop installed, so the test fast-fails at the `resolveScoopBinary`
54+
assertion (89 ms). The bare-host run reaches further (1.75–2.9 s) before
55+
failing on a different assertion. **For the apples-to-apples overhead
56+
number, treat the test wall time as a constant and read the difference**
57+
between the environment's total wall time and that constant.
58+
59+
## Windows Sandbox: measured ~12 s overhead floor (2026-06-06)
60+
61+
```
62+
T0_wsb_launch = 0.000 s
63+
T1_logon_fired_host_observed = +9.086 s ← Sandbox cold-boot + LogonCommand
64+
T2_script_started = +11.438 s ← cmd.exe → powershell.exe handoff
65+
T3_vc_staged = +11.747 s ← VC++ DLL copy to System32 (~0.3 s)
66+
T4_test_started = +11.790 s ← stage test exe + sqlite + repro
67+
T5_test_finished = +11.900 s ← test ran (89 ms — fast-fail on missing scoop)
68+
T6_done = +11.928 s
69+
TOTAL host wall = 12.072 s
70+
```
71+
72+
Cost breakdown:
73+
74+
| Phase | Cost | Comment |
75+
|---|---|---|
76+
| Sandbox cold-boot | ~9 s | Win11 Sandbox is faster than reputation; on older hosts expect 30–60 s |
77+
| Cmd→PowerShell hop | ~2 s | LogonCommand is `cmd.exe /c` for diagnostic reasons (see `migration.wsb` header) |
78+
| VC++ stage | <1 s | Copying 7 DLLs from the mapped folder to System32 |
79+
| Test wall | = bare-host wall | Sandbox's CPU is host-equivalent; the test runs at host speed |
80+
81+
**Implication.** Per-test sandbox cost is ~12 s + test wall time. So:
82+
- 1-second test → 12 s overhead → **13x**
83+
- 10-second test → 12 s overhead → **2.2x**
84+
- 60-second test → 12 s overhead → **1.2x**
85+
- 100 tests batched into ONE sandbox session → 12 s amortized across 100 → **+0.12 s per test**
86+
87+
The cost is the cold-boot, not the per-test work. Batching is the
88+
right answer if Sandbox isolation suffices.
89+
90+
## Hyper-V VM: measured ~29 s overhead floor (2026-06-06)
91+
92+
Reusing the M69 harness VM `repro-m69-hyperv` reverted to its `base-clean`
93+
snapshot:
94+
95+
```
96+
T0_start = 0.000 s
97+
T1_revert_done = +0.189 s ← Restore-VMCheckpoint (diff-layer drop)
98+
T2_psdirect_ready = +26.592 s ← Start-VM + Windows boot + PSDirect handshake
99+
T3_stage_done = +26.760 s ← Copy-VMFile a tiny payload (host → guest)
100+
T4_invoke_done = +28.315 s ← Invoke-Command -VMName ran Get-Date + Get-Content
101+
T5_stopped = +28.582 s ← Stop-VM -TurnOff
102+
TOTAL host wall = 28.586 s
103+
```
104+
105+
Cost breakdown:
106+
107+
| Phase | Cost | Comment |
108+
|---|---|---|
109+
| Restore-VMCheckpoint | ~0.2 s | The differencing-disk revert is a metadata flip |
110+
| Start-VM + boot to PSDirect | ~26 s | Full Windows guest boot to where `Invoke-Command -VMName { hostname }` succeeds |
111+
| Copy-VMFile stage | <0.2 s | tiny file; scales with payload size |
112+
| Invoke-Command round-trip | ~1.5 s | PSDirect channel overhead per RPC, not the command itself |
113+
| Stop-VM -TurnOff | ~0.3 s | Hard power-off; no clean shutdown |
114+
115+
**Implication.** Per-test Hyper-V cost is ~29 s + test wall time + ~1.5 s
116+
per Invoke-Command round-trip (so if the per-test runner stages, runs,
117+
collects logs as three separate Invoke-Commands, that's ~4.5 s of RPC
118+
overhead on top of the 29 s boot).
119+
120+
Comparison (cold-boot path):
121+
- 1-second test → Hyper-V = +29 s → **30x**
122+
- 10-second test → Hyper-V = +29 s → **3.9x**
123+
- 60-second test → Hyper-V = +29 s → **1.5x**
124+
- 100 tests batched into ONE Hyper-V session → 29 s amortized → **+0.29 s per test**
125+
126+
Hyper-V is ~2.4x slower per session than Sandbox (29 s vs 12 s) but
127+
provides full Windows isolation including Windows Update access,
128+
persistent disk, and reboot capability — the three things Sandbox
129+
cannot provide.
130+
131+
## Hyper-V VM with HOT-snapshot revert: measured ~5.4 s per test (2026-06-08)
132+
133+
Standard Checkpoints in Hyper-V capture the memory + CPU + device state
134+
of a RUNNING VM. `Restore-VMCheckpoint` to such a snapshot returns the
135+
VM to that exact running state — no Windows boot, no re-OOBE,
136+
no rebuilding of the Win32 subsystem. The existing `base-clean`
137+
snapshot is a cold snapshot (taken with the VM Off) so it has no memory
138+
state; `run-hyperv-bench-hot.ps1` takes a fresh `base-hot` snapshot
139+
once with the VM running, then measures the revert cycle.
140+
141+
```
142+
Phase A — one-time setup:
143+
A0_start = 0.000 s
144+
A1_first_boot_done = +46.468 s ← cold boot, only paid ONCE
145+
A2_hot_snapshot_taken = +2.220 s ← captures RAM + CPU + devices
146+
147+
Phase B — revert-from-hot × 3 iterations:
148+
iter1: restore 4.16 s + PSDirect 0.94 s = 5.10 s
149+
iter2: restore 4.72 s + PSDirect 0.93 s = 5.65 s
150+
iter3: restore 4.51 s + PSDirect 0.97 s = 5.48 s
151+
AVERAGE = 5.41 s
152+
153+
Phase C — Save-VM / Start-VM (hibernate, NOT a reset):
154+
C1_save_returned = +1.673 s ← writes RAM to disk
155+
C2_start_returned = +2.003 s ← reads RAM back
156+
C3_psdirect_ready = +0.943 s
157+
TOTAL = 4.62 s
158+
```
159+
160+
**This changes the routine-CI picture.** With hot-snapshot revert:
161+
162+
| Test wall | Bare host | Sandbox (per-test) | Hyper-V (hot revert) |
163+
|---|---|---|---|
164+
| 1 s | 1 s | 13 s (13×) | **6.4 s (6.4×)** |
165+
| 10 s | 10 s | 22 s (2.2×) | **15.4 s (1.5×)** |
166+
| 60 s | 60 s | 72 s (1.2×) | **65.4 s (1.1×)** |
167+
| 100 batched | 100 t | +0.12 s/test | **46 s setup + 100 × (5.4 + test_wall)** |
168+
169+
Hyper-V with hot revert is **competitive with per-test Sandbox** for
170+
sub-minute tests, AND it gives every test full pristine state without
171+
needing in-test cleanup discipline. For tests requiring DISM / Windows
172+
Update / reboot it's the only option — and the cost is no longer
173+
prohibitive.
174+
175+
**Save-VM / Start-VM is a different tool.** It's hibernate: state is
176+
preserved across the cycle, so it doesn't give you a reset. Useful only
177+
for "warm restart this same state" workflows (e.g., resume after a
178+
host-side power blip during a long test session). Don't confuse it
179+
with hot-snapshot revert.
180+
181+
**Sandbox has no equivalent.** Windows Sandbox is a Hyper-V-isolated
182+
container, but its lifecycle is wrapped by the Sandbox Manager which
183+
exposes no save/checkpoint API. There is no `Save-Sandbox` cmdlet, no
184+
in-config checkpoint directive, and no `*-Sandbox` PowerShell command
185+
beyond launching one via `WindowsSandbox.exe <wsb-file>`. Mapped
186+
writable folders are the only state that survives a session. So the
187+
12 s Sandbox cost is per-session, full stop — you can't amortize it
188+
the way you can with Hyper-V hot revert.
189+
190+
## Hyper-V hot checkpoints are portable across hosts (2026-06-08)
191+
192+
`run-hyperv-bench-portable.ps1` exports a VM with a hot Standard
193+
Checkpoint, then imports it back as a new VM with a fresh ID and
194+
times the resume cycle:
195+
196+
```
197+
Phase A (one-time setup, paid once per cached image):
198+
First boot to PSDirect 43.838 s
199+
Checkpoint-VM (Standard, hot) 2.111 s
200+
Stop-VM 0.313 s
201+
202+
Phase B (Export-VM):
203+
Export-VM returned 1.737 s (same-volume reflink)
204+
export_total_gb 53.21 GB
205+
.vhdx files 52.53 GB (2 files, base + diff)
206+
.avhdx files 1.25 GB (snapshot diffs)
207+
.VMRS files (memory state) 0.69 GB (3 files; the big one is the hot checkpoint's RAM image)
208+
.vmgs files 0.01 GB
209+
.vmcx files ~120 KB
210+
211+
Phase C (Import-VM and resume on the IMPORTED VM):
212+
Import-VM 3.023 s
213+
imported_snapshot_names = base-clean, exp-hot ← both came through
214+
Restore-VMCheckpoint exp-hot 0.128 s
215+
Start-VM (memory resume) 3.740 s
216+
PSDirect ready 0.979 s
217+
TOTAL import+resume 7.870 s
218+
```
219+
220+
**Bottom line:**
221+
- `.VMRS` files are the snapshot's memory + CPU + device state, and they ARE included in `Export-VM`.
222+
- `Import-VM` brings back the full snapshot tree.
223+
- `Restore-VMCheckpoint` to a hot snapshot on the imported VM works the same as on the original.
224+
- Same-volume export uses reflinks/hardlinks for VHDX files; **the real cross-host payload is ~10 GB** (the VHDX content) + 0.7 GB (memory state) + ~13 MB (config) ≈ **10.7 GB uncompressed**. VHDX content is highly compressible (lots of zeros from sparse provisioning).
225+
226+
**CI artifact-caching model:**
227+
- ONE CI runner (the "warmer") pays the 44 s boot cost ONCE, takes the hot checkpoint, exports the VM, compresses the export folder, and uploads it as a CI artifact.
228+
- Every other runner pulls the artifact, decompresses, `Import-VM`s, `Restore-VMCheckpoint`s to the hot snapshot, `Start-VM`s. Total runner-side cost on a warm machine: **~8 s**.
229+
- Per-test cost on the imported VM: ~5.4 s (the same hot-revert cycle).
230+
231+
**Cross-host caveats:**
232+
- **CPU compatibility.** Memory-state snapshots capture CPU registers and feature flags. Importing on a CPU that lacks features the snapshot expects (e.g. older AVX support) may fail or produce subtle errors. Hyper-V has a "Migrate to a physical computer with a different processor version" option on VM CPU config that masks features down to a baseline — set this on the warmer VM if the CI fleet is heterogeneous.
233+
- **Hyper-V version skew.** A newer Hyper-V's export should import on the same or newer version; downgrade is not supported.
234+
- **Generation 1 vs 2.** Same generation in both ends. The harness VM here is Gen 2.
235+
236+
## When portability is worth the bother
237+
238+
It's worth it when:
239+
- The CI fleet has many runners and the per-runner boot cost (44 s) sums to a real wall-clock loss.
240+
- The test suite needs Hyper-V isolation (DISM, reboots, VS Installer — see `tools/hyperv-m69-system/README.md`) and so can't fall back to Sandbox.
241+
- The runners can store ~10 GB of cached image.
242+
243+
It's NOT worth it when:
244+
- The whole suite is bare-host eligible (REPRO_REGISTRY_ROOT + per-test tempdirs cover it).
245+
- There are <10 runners in the fleet — the warmer-runner cost amortizes badly.
246+
- The test wall time per runner is dominated by per-test work, not by the one-time boot.
247+
248+
The first-time provisioning cost (downloading the 20-50 GB Windows 11
249+
dev VHDX, running OOBE, uninstalling VS, installing Nim/gcc, snapshotting)
250+
is **NOT** in the per-test overhead; it's a one-time bootstrap. See
251+
`tools/hyperv-m69-system/README.md`.
252+
253+
## Which path to pick
254+
255+
The four paths are complementary, not interchangeable. Use this
256+
decision table:
257+
258+
| Test class | Use |
259+
|---|---|
260+
| Touches process-local state only (no HKCU/PATH/services) | bare host |
261+
| Writes to HKCU (env.userPath, registry resources) | bare host with REPRO_REGISTRY_ROOT (see project memory) — the leak fix supersedes the need to sandbox these |
262+
| Touches files in stable system paths (Program Files, ProgramData), needs per-test pristine | Hyper-V hot-revert — runs at 5.4 s/test with full reset |
263+
| Touches files BUT tests can be ordered/grouped so they don't collide | Sandbox per-test (12 s) OR Sandbox batched (if cleanup discipline is real) |
264+
| Needs DISM / OptionalFeature / Capability / WSL / VS Installer / reboot | Hyper-V VM (`tools/hyperv-m69-system/`) — Sandbox cannot provide Windows Update, persistent disk, or reboot capability |
265+
| Needs full Linux destructive scope | throwaway WSL (separate harness; see destructive-gate environments memo) |
266+
267+
The combination of REPRO_REGISTRY_ROOT (driver-level seam, 0 overhead)
268+
+ Hyper-V hot-revert (5.4 s/test, full pristine state) covers the
269+
vast majority of the destructive-test surface without per-test cleanup
270+
discipline. Use Sandbox where its lower memory footprint (~4 GB vs
271+
Hyper-V's whole guest OS) matters more than the per-test isolation
272+
gap.
273+
274+
Sandbox and Hyper-V both isolate from the host, but Sandbox can't run
275+
Windows Update / reboot / install VS. That's the dividing line
276+
documented in `tools/hyperv-m69-system/README.md` § "Why Hyper-V (and
277+
not Sandbox)" — quoting the empirical record (DISM payload fetch fails;
278+
VS Build Tools >1 hour; no reboots).
279+
280+
## Files
281+
282+
| File | Runs on | Purpose |
283+
|---|---|---|
284+
| `bench.wsb` | host | Windows Sandbox config; mapped folders + LogonCommand |
285+
| `provision-and-bench.ps1` | inside Sandbox | Stages VC++ DLLs and runs the bench payload; writes TIMINGS.txt |
286+
| `run-sandbox-bench.ps1` | host | Launches the sandbox, polls for DONE, reports timing |
287+
| `run-hyperv-bench.ps1` | host | Reverts the M69 harness VM (cold path), runs a trivial payload, reports timing |
288+
| `run-hyperv-bench-hot.ps1` | host | Takes a hot Standard Checkpoint, measures revert-to-running and Save-VM/Start-VM cycles |
289+
| `run-hyperv-bench-portable.ps1` | host | Round-trips a hot checkpoint through Export-VM / Import-VM; proves portability and reports import+resume cost |
290+
| `README.md` || This file |

tools/sandbox-bench/bench.wsb

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
<Configuration>
2+
<!--
3+
Sandbox-bench harness - Windows Sandbox config for measuring
4+
per-test-suite overhead vs bare-host runs. Modelled on
5+
tools/sandbox-migration/migration.wsb but stripped to the
6+
minimum needed to run a pre-built nim test exe:
7+
8+
- the test binary directory (build/test-bin)
9+
- the production repro binary directory (build/bin) so any
10+
subprocess apply has repro.exe + sqlite3_64.dll on hand
11+
- vcruntime DLLs (mandatory: a pristine Windows Sandbox image
12+
ships without the VC++ 2015-2022 runtime; mingw-gcc-built
13+
nim exes need msvcp140 et al. to LoadLibrary correctly)
14+
- the bench provision script
15+
- the writable OUTPUT directory
16+
17+
Nothing else is needed because the bench runs a PRE-BUILT exe,
18+
so the dev shell and the Nim/gcc toolchain stay on the host.
19+
-->
20+
<VGpu>Disable</VGpu>
21+
<Networking>Disable</Networking>
22+
<MemoryInMB>4096</MemoryInMB>
23+
<MappedFolders>
24+
<MappedFolder>
25+
<HostFolder>D:\metacraft\reprobuild\build\test-bin</HostFolder>
26+
<SandboxFolder>C:\harness\test-bin</SandboxFolder>
27+
<ReadOnly>true</ReadOnly>
28+
</MappedFolder>
29+
<MappedFolder>
30+
<HostFolder>D:\metacraft\reprobuild\build\bin</HostFolder>
31+
<SandboxFolder>C:\harness\repro-bin</SandboxFolder>
32+
<ReadOnly>true</ReadOnly>
33+
</MappedFolder>
34+
<MappedFolder>
35+
<HostFolder>D:\metacraft\reprobuild\tools\sandbox-migration\vcruntime</HostFolder>
36+
<SandboxFolder>C:\harness\vcruntime</SandboxFolder>
37+
<ReadOnly>true</ReadOnly>
38+
</MappedFolder>
39+
<MappedFolder>
40+
<HostFolder>D:\metacraft\reprobuild\tools\sandbox-bench</HostFolder>
41+
<SandboxFolder>C:\harness\scripts</SandboxFolder>
42+
<ReadOnly>true</ReadOnly>
43+
</MappedFolder>
44+
<MappedFolder>
45+
<HostFolder>D:\metacraft\sandbox-bench-out</HostFolder>
46+
<SandboxFolder>C:\harness\out</SandboxFolder>
47+
<ReadOnly>false</ReadOnly>
48+
</MappedFolder>
49+
</MappedFolders>
50+
51+
<LogonCommand>
52+
<Command>cmd.exe /c "echo logon-fired %DATE% %TIME% &gt; C:\harness\out\_logon-heartbeat.txt &amp; copy /Y C:\harness\scripts\provision-and-bench.ps1 C:\provision-and-bench.ps1 &gt;&gt; C:\harness\out\_logon-heartbeat.txt 2&gt;&amp;1 &amp; powershell.exe -NoProfile -ExecutionPolicy Bypass -File C:\provision-and-bench.ps1 &gt; C:\harness\out\_logon-powershell.log 2&gt;&amp;1"</Command>
53+
</LogonCommand>
54+
</Configuration>

0 commit comments

Comments
 (0)