|
| 1 | +# Spec 047 §14 Phase 3 completion — ARM64 ratification capture (LAPTOP-4MEP83VI) |
| 2 | + |
| 3 | +**Result: NOT RATIFIED / INCONCLUSIVE.** This capture is valuable |
| 4 | +evidence but does **not** satisfy the ARM64 stable-AC ratification gate. |
| 5 | +The fixed variant-ordering run shows strong time/order drift (suspected |
| 6 | +thermal throttling) that systematically disadvantages whichever variant |
| 7 | +runs last — which is always `ReactorDescriptors`. Under the contaminated |
| 8 | +numbers the §13 Q1 gating bench M2 exceeds the 15% threshold; a |
| 9 | +controlled order-swap re-run (below) proves that M2 "regression" is a |
| 10 | +position artifact, not a real descriptor cost. A thermally-clean Phase 3 |
| 11 | +ARM64 re-run is still required to formally close the §14 gate. |
| 12 | + |
| 13 | +This is the capture the spec defers to in §14 |
| 14 | +("ARM64 stable-AC ratification gate — pending"). It is the authoritative |
| 15 | +**machine** (`LAPTOP-4MEP83VI`, the Phase 0/2 baseline machine) but the |
| 16 | +**run conditions did not stay stable**, so it cannot stand as the |
| 17 | +ratifying capture on its own. |
| 18 | + |
| 19 | +## Capture environment |
| 20 | + |
| 21 | +`LAPTOP-4MEP83VI`, ARM64-native (Qualcomm/Snapdragon, ARMv8 64-bit), |
| 22 | +Release, .NET 10.0.8, Windows 11 26200. AC power connected (battery 80%), |
| 23 | +Windows power plan forced to **High performance** for the run and restored |
| 24 | +to Balanced afterward. Branch `spec/047-phase3-completion` @ HEAD |
| 25 | +(PR #440). The `PerfBench.ControlModel` harness is **unchanged from |
| 26 | +`main` on this branch** — the bench's `DescriptorVariantFactory` |
| 27 | +registration set is identical to prior captures (so this measures the |
| 28 | +same descriptor interpreter, not the production `RegisterV1BuiltInHandlers` |
| 29 | +~76-type table). |
| 30 | + |
| 31 | +3 process launches × 5 reps × 13 benches × 4 variants = 780 measurements |
| 32 | +in `launch-1.jsonl` + `launch-2.jsonl` + `launch-3.jsonl`. The |
| 33 | +order-swap confirmation adds 180 measurements in |
| 34 | +`confirm-reversed-launch-{1,2,3}.jsonl`. |
| 35 | + |
| 36 | +> **Note on power telemetry.** The bench records `powerState`/`powerPlan` |
| 37 | +> as `unknown` (env capture does not read them). The "High performance / |
| 38 | +> AC" conditions above are documented manually, not embedded in the JSON. |
| 39 | +> No CPU frequency / package-temperature / throttle telemetry was |
| 40 | +> captured, so "thermal throttling" below is the **suspected** mechanism |
| 41 | +> of the observed time/order drift, not a directly measured fact. |
| 42 | +
|
| 43 | +## Headline — V1 ON (descriptors) vs V1 OFF (today), median-of-n=15 |
| 44 | + |
| 45 | +Primary run, fixed variant order per bench |
| 46 | +(Direct → ReactorToday → ReactorV2 → **ReactorDescriptors last**). |
| 47 | +Full per-cell table with 95% CI in `summary.md`. |
| 48 | + |
| 49 | +| Bench | Desc vs Today (ns) | Desc vs ReactorV2 (ns) | Trust | |
| 50 | +|---|---:|---:|---| |
| 51 | +| M1 Mount_Leaf_NoCallback | +30.1% | -1.6% | **High** (fast, stable) | |
| 52 | +| M2 Mount_Leaf_OneCallback | +23.4% | +36.1% | **Low** (drift-contaminated) | |
| 53 | +| M3 Mount_Leaf_ThreeCallbacks | +175.3% | +119.0% | **Invalid** (drift-contaminated) | |
| 54 | +| M4 Dispatch_Switch_Cold | -17.4% | -22.3% | Low (drift) | |
| 55 | +| M5 Dispatch_Switch_Warm | -30.4% | -28.9% | Low (drift) | |
| 56 | +| M6 Dispatch_ExternalType | -4.0% | -0.8% | High | |
| 57 | +| M7 Update_NoChange | +8.9% | +3.5% | **High** (fast, stable) | |
| 58 | +| M8 Update_OneLeafChanged | +17.9% | +1.4% | **High** (fast, stable) | |
| 59 | +| M9 Update_AllChanged | +3.4% | +1.2% | Medium (long but alloc-bound) | |
| 60 | +| M10 EventHandlerState_Alloc| +17.4% | +15.9% | Low (drift) | |
| 61 | +| M11 ModifierEHS_Frequency | +11.5% | +0.5% | **High** (fast, stable) | |
| 62 | +| M12 Pool_Rent_HotPath | +44.2% | +5.6% | Low (drift) | |
| 63 | +| M13 Setters_Suppression | -3.0% | -4.0% | High (correctness bench) | |
| 64 | + |
| 65 | +## Why these numbers are contaminated — the drift evidence |
| 66 | + |
| 67 | +Within a single launch, `ReactorDescriptors` mean ns climbs steeply from |
| 68 | +rep0 → rep4 on the long-running benches, while the short benches stay |
| 69 | +flat: |
| 70 | + |
| 71 | +| Bench | rep0 → rep4 climb (Descriptors) | per-rep duration | |
| 72 | +|---|---:|---| |
| 73 | +| M1 | +24% | ~40 µs | |
| 74 | +| M2 | +45% | ~95 µs | |
| 75 | +| M3 | +55% | ~1.7 ms | |
| 76 | +| M4 | +28% | ~100 µs | |
| 77 | +| M5 | +30% | ~100 µs | |
| 78 | +| M12 | +11% | ~55 µs | |
| 79 | +| M7 / M8 / M11 / M13 | ≈flat | ≤10 µs | |
| 80 | + |
| 81 | +The climb tracks bench duration, not the variant — the classic |
| 82 | +signature of a CPU shedding clock under sustained load on a fanless / |
| 83 | +thermally-limited ARM64 laptop. Because the four variants run |
| 84 | +back-to-back within each bench and `ReactorDescriptors` is **always |
| 85 | +scheduled last**, it runs against the hottest core in each bench window. |
| 86 | +The means are therefore **not independent of run position**. |
| 87 | + |
| 88 | +## Decisive control — order-swap re-run (gating benches) |
| 89 | + |
| 90 | +To separate "real regression" from "position artifact" I re-ran the |
| 91 | +§13 Q1 gating benches (M1/M2/M5/M7) with the variant order **reversed** |
| 92 | +so `ReactorDescriptors` runs **first / cold** and `ReactorToday` runs |
| 93 | +last / hot (`--variant ReactorDescriptors ReactorV2 ReactorToday`, |
| 94 | +3 launches). Raw data: `confirm-reversed-launch-{1,2,3}.jsonl`. |
| 95 | + |
| 96 | +| Bench | Desc vs Today — Desc LAST | Desc vs Today — Desc FIRST | swing | |
| 97 | +|---|---:|---:|---:| |
| 98 | +| M1 | +30.1% | +32.5% | +2.4pp (stable) | |
| 99 | +| M2 | +23.4% | **-30.5%** | **-54.0pp (sign flip)** | |
| 100 | +| M5 | -30.4% | -7.2% | +23.3pp | |
| 101 | +| M7 | +8.9% | +128.2% | +119.4pp (see note) | |
| 102 | + |
| 103 | +| Bench | Desc vs ReactorV2 — LAST | Desc vs ReactorV2 — FIRST | |
| 104 | +|---|---:|---:| |
| 105 | +| M1 | -1.6% | +9.2% | |
| 106 | +| M2 | **+36.1%** | **+1.1%** | |
| 107 | +| M5 | -28.9% | -11.4% | |
| 108 | +| M7 | +3.5% | +113.7% (see note) | |
| 109 | + |
| 110 | +**Reading:** |
| 111 | + |
| 112 | +- **M2 is the headline proof.** Its Descriptors-vs-Today delta flips |
| 113 | + from **+23.4%** (Descriptors last) to **−30.5%** (Descriptors first) — |
| 114 | + a 54-percentage-point swing driven purely by execution position. The |
| 115 | + order-robust Descriptors-vs-ReactorV2 comparison collapses from +36.1% |
| 116 | + to **+1.1%** when both variants sit in comparable positions. **There |
| 117 | + is no real M2 descriptor regression** — the formal Q1 failure in the |
| 118 | + primary table is a contamination artifact. |
| 119 | +- **M1 is order-robust** (+30% vs Today in both orderings) and is |
| 120 | + Descriptors ≈ ReactorV2 (±10pp). This is the genuine **V1-protocol |
| 121 | + vs legacy mount overhead** seen in every prior capture (it is not |
| 122 | + descriptor-specific — hand-coded V1 pays the same). |
| 123 | +- **M5** stays a Descriptors win in both orderings (direction robust). |
| 124 | +- **M7 reversed has its own artifact** — a rep0→rep1 step jump |
| 125 | + (~10 µs → ~27 µs) appears for Descriptors in the small-selection |
| 126 | + reversed run (likely JIT tiering / background recompilation specific |
| 127 | + to the reduced job set). The **full-run** M7 (+8.9% vs Today, +3.5% |
| 128 | + vs V2, flat across reps) is the trustworthy M7 number; the reversed |
| 129 | + M7 should be disregarded. |
| 130 | + |
| 131 | +## What can and cannot be concluded |
| 132 | + |
| 133 | +**Supported by the thermally-insensitive (fast, flat) benches** — |
| 134 | +M1, M7, M8, M11, M13 — where Descriptors vs ReactorV2 is within ±5% |
| 135 | +(M1 -1.6%, M7 +3.5%, M8 +1.4%, M11 +0.5%, M13 -4.0%): in paths that do |
| 136 | +not heat the core, **descriptor dispatch/interpreter overhead over |
| 137 | +hand-coded V1 is small.** This is consistent with the Phase 2 stable-AC |
| 138 | +capture and the x64 advisory captures. It does **not** prove "descriptors |
| 139 | +add zero cost" globally — the drift-contaminated long benches are simply |
| 140 | +unmeasurable on this run. |
| 141 | + |
| 142 | +**Unresolved on this capture** (require a thermally-clean re-run): |
| 143 | +M2, M3, M4, M5, M10, M12. M3 +175.3% in particular is **invalidated by |
| 144 | +drift**, not shown to be real — but also not shown to be benign; M3 |
| 145 | +exercises the 3-callback wiring path and deserves a clean measurement. |
| 146 | + |
| 147 | +**Allocation note.** This README interprets timing. Allocation deltas are |
| 148 | +in `summary.md`; they are not the gating axis for §13 Q1 (which keys off |
| 149 | +ns vs ReactorV2). A few are worth a glance on the clean re-run — e.g. |
| 150 | +M7 Descriptors alloc is higher than Today (extra `EventHandlerState` / |
| 151 | +binding state on the V1 path), consistent with the known V1 memory |
| 152 | +profile rather than a new regression. |
| 153 | + |
| 154 | +## Recommendation |
| 155 | + |
| 156 | +1. **Do not cite these primary deltas in §13/§14 spec text.** Treat this |
| 157 | + capture as *inconclusive* for ratification. |
| 158 | +2. **Re-run on `LAPTOP-4MEP83VI` under controlled thermal conditions** |
| 159 | + before closing the §14 gate. Concretely, mitigate the order/thermal |
| 160 | + confound with one or more of: |
| 161 | + - randomize / rotate variant order per launch (or interleave per rep); |
| 162 | + - insert a cooldown (`Start-Sleep`) between variants and between benches; |
| 163 | + - reduce `--iterations` so each bench window is shorter / cooler; |
| 164 | + - capture CPU effective-clock / package-temp telemetry alongside the run |
| 165 | + so "thermal" stops being an inference. |
| 166 | + The clean run must put M2 back under the Q1 threshold (the order-swap |
| 167 | + says it will: Desc ≈ V2 at +1.1%) and give a real M3 number. |
| 168 | +3. Until that clean run lands, the §14 ARM64 gate stays **pending**, now |
| 169 | + with a named owner/date to be appended in the spec. |
| 170 | + |
| 171 | +## Files |
| 172 | + |
| 173 | +- `launch-{1,2,3}.jsonl` — primary 3×5 capture (fixed variant order). 780 rows. |
| 174 | +- `summary.md` — aggregator output (per-cell means + 95% CI + Q1 deltas). |
| 175 | +- `confirm-reversed-launch-{1,2,3}.jsonl` — order-swap control |
| 176 | + (Descriptors first), gating benches M1/M2/M5/M7. 180 rows. |
| 177 | +- `aggregate.py` — reads `launch-*.jsonl`; run with no args from this dir. |
| 178 | + |
| 179 | +## Reproduce |
| 180 | + |
| 181 | +```powershell |
| 182 | +dotnet build tests/perf_bench/PerfBench.ControlModel -c Release -p:Platform=ARM64 |
| 183 | +$exe = "tests\perf_bench\PerfBench.ControlModel\bin\ARM64\Release\net10.0-windows10.0.22621.0\PerfBench.ControlModel.exe" |
| 184 | +$out = "docs\specs\047\phase3-results\LAPTOP-4MEP83VI\2026-05-28-phase3-completion-3x5-stableac" |
| 185 | +$results = "tests\perf_bench\PerfBench.ControlModel\bin\ARM64\Release\net10.0-windows10.0.22621.0\results.jsonl" |
| 186 | +for ($i = 1; $i -le 3; $i++) { |
| 187 | + Remove-Item $results -ErrorAction SilentlyContinue |
| 188 | + Start-Process -FilePath $exe -Wait -NoNewWindow # -Wait required; & $exe does not block this WinUI app |
| 189 | + Copy-Item $results "$out\launch-$i.jsonl" |
| 190 | +} |
| 191 | +python "$out\aggregate.py" > "$out\summary.md" |
| 192 | +
|
| 193 | +# order-swap control: |
| 194 | +for ($i = 1; $i -le 3; $i++) { |
| 195 | + Remove-Item $results -ErrorAction SilentlyContinue |
| 196 | + Start-Process -FilePath $exe -Wait -NoNewWindow -ArgumentList @( |
| 197 | + "--test","M1","M2","M5","M7","--variant","ReactorDescriptors","ReactorV2","ReactorToday") |
| 198 | + Copy-Item $results "$out\confirm-reversed-launch-$i.jsonl" |
| 199 | +} |
| 200 | +``` |
| 201 | + |
| 202 | +## Captures index |
| 203 | + |
| 204 | +- `../../phase2-results/LAPTOP-4MEP83VI/2026-05-26-q1-fastpath-3x5-stableac/` |
| 205 | + — Phase 2 Q1 stable-AC capture (clean; M1 -1.0%, M2 +9.6%). The |
| 206 | + reference for what a thermally-clean ARM64 run looks like. |
| 207 | +- `../CPC-ander-YTZ3O-x64-advisory/2026-05-28-phase3-finish-3x5/` — |
| 208 | + latest x64 Cloud-PC advisory (M3 -1.8%, well within noise) — supports |
| 209 | + the "M3 +175% is contamination" reading but is itself advisory-only. |
| 210 | +- `./` (this dir) — Phase 3 completion ARM64 attempt. **Not ratifying** |
| 211 | + due to thermal/order drift; superseded once a clean ARM64 re-run lands. |
0 commit comments