Skip to content

Commit 4fa02f7

Browse files
spec(047): ARM64 Phase 3 ratification capture - inconclusive (thermal/order drift) (#441)
Adds the ARM64-native perf capture on LAPTOP-4MEP83VI (the Phase 0/2 baseline machine) for the spec 047 §14 ARM64 ratification gate, under docs/specs/047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/. The fixed variant-ordering run drifted under sustained load (suspected thermal throttling - ReactorDescriptors always runs last), inflating long-bench deltas. A controlled order-swap re-run (Descriptors first) proves the contamination: M2 Descriptors-vs-Today flips +23.4% -> -30.5% and Descriptors-vs-ReactorV2 collapses +36.1% -> +1.1% (no real regression). Capture is documented as NOT RATIFYING; §14 gate stays pending with the evidence + reproduction needed for a clean re-run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 6d2b1db commit 4fa02f7

10 files changed

Lines changed: 1368 additions & 1 deletion

File tree

docs/specs/047-extensible-control-model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1455,7 +1455,7 @@ ARM64 stable-AC re-capture on `LAPTOP-4MEP83VI` remains deferred for the §14 ra
14551455
- **New regressions vs close-out:** M8 +21.8% (+2.9pp — Lazy*Stack base-derived registration's added is-check in the Update path), M12 +30.7% (+12.2pp — Cloud-PC volatile; M12 has trended ±15pp across the last three captures and should be confirmed on stable AC).
14561456
- **Net headline:** no bench exceeds the §13 Q1 reopen threshold. The structural wins (dispatch consolidation, single `IItemsBinderStrategy` arm) are in place; the absolute Cloud-PC numbers track the close-out baseline.
14571457

1458-
**ARM64 stable-AC ratification gate** — pending. The Phase 3 finish §14 close-out is gated on either (a) a re-capture on `LAPTOP-4MEP83VI` landing under `docs/specs/047/phase3-results/`, or (b) a tracking issue with a named owner + target date filed and referenced here. *Owner / date assignment to be appended once filed.*
1458+
**ARM64 stable-AC ratification gate** — **still pending; first capture attempt was inconclusive.** An ARM64-native 3×5 capture on `LAPTOP-4MEP83VI` (the Phase 0/2 baseline machine) landed under [`docs/specs/047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/`](047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/README.md) but **does not ratify the gate**: the fixed variant-ordering run drifted under sustained load (suspected thermal throttling — `ReactorDescriptors` always runs last and so against the hottest core), inflating long-bench deltas (M2 +23.4%, M3 +175.3%, M12 +44.2% vs Today). A controlled **order-swap re-run** (Descriptors first/cold) proves the contamination: M2's Descriptors-vs-Today delta flips from +23.4% to −30.5% (a 54pp position swing), and Descriptors-vs-ReactorV2 collapses from +36.1% to +1.1% — i.e. no real M2 regression. The thermally-insensitive fast benches confirm descriptors ≈ hand-coded V1 (M1/M7/M8/M11/M13 within ±5% vs ReactorV2), and M1's order-robust +30% vs Today is the known V1-protocol-vs-legacy mount overhead, not descriptor-specific. **A thermally-clean ARM64 re-run** (randomized/interleaved variant order, cooldowns, and/or CPU-clock telemetry) is still required to close the gate; until then it remains pending with a named owner + date to be appended. See the capture README for the full drift evidence and reproduction steps.
14591459

14601460
**Carry-forward known defects from Phase 1:**
14611461
- **KD-3** — dispatch fast-path for the ported built-ins (M4 was +88.9% V1 vs Today at Phase 1; final advisory shows M4 −21.2% / M5 −24.3% at amortized scope — KD-3 has materially closed at the batch-11 registration set).
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Spec 047 §14 Phase 3 completion — ARM64 ratification capture (LAPTOP-4MEP83VI)
2+
3+
**Result: NOT RATIFIED / INCONCLUSIVE.** This capture is valuable
4+
evidence but does **not** satisfy the ARM64 stable-AC ratification gate.
5+
The fixed variant-ordering run shows strong time/order drift (suspected
6+
thermal throttling) that systematically disadvantages whichever variant
7+
runs last — which is always `ReactorDescriptors`. Under the contaminated
8+
numbers the §13 Q1 gating bench M2 exceeds the 15% threshold; a
9+
controlled order-swap re-run (below) proves that M2 "regression" is a
10+
position artifact, not a real descriptor cost. A thermally-clean Phase 3
11+
ARM64 re-run is still required to formally close the §14 gate.
12+
13+
This is the capture the spec defers to in §14
14+
("ARM64 stable-AC ratification gate — pending"). It is the authoritative
15+
**machine** (`LAPTOP-4MEP83VI`, the Phase 0/2 baseline machine) but the
16+
**run conditions did not stay stable**, so it cannot stand as the
17+
ratifying capture on its own.
18+
19+
## Capture environment
20+
21+
`LAPTOP-4MEP83VI`, ARM64-native (Qualcomm/Snapdragon, ARMv8 64-bit),
22+
Release, .NET 10.0.8, Windows 11 26200. AC power connected (battery 80%),
23+
Windows power plan forced to **High performance** for the run and restored
24+
to Balanced afterward. Branch `spec/047-phase3-completion` @ HEAD
25+
(PR #440). The `PerfBench.ControlModel` harness is **unchanged from
26+
`main` on this branch** — the bench's `DescriptorVariantFactory`
27+
registration set is identical to prior captures (so this measures the
28+
same descriptor interpreter, not the production `RegisterV1BuiltInHandlers`
29+
~76-type table).
30+
31+
3 process launches × 5 reps × 13 benches × 4 variants = 780 measurements
32+
in `launch-1.jsonl` + `launch-2.jsonl` + `launch-3.jsonl`. The
33+
order-swap confirmation adds 180 measurements in
34+
`confirm-reversed-launch-{1,2,3}.jsonl`.
35+
36+
> **Note on power telemetry.** The bench records `powerState`/`powerPlan`
37+
> as `unknown` (env capture does not read them). The "High performance /
38+
> AC" conditions above are documented manually, not embedded in the JSON.
39+
> No CPU frequency / package-temperature / throttle telemetry was
40+
> captured, so "thermal throttling" below is the **suspected** mechanism
41+
> of the observed time/order drift, not a directly measured fact.
42+
43+
## Headline — V1 ON (descriptors) vs V1 OFF (today), median-of-n=15
44+
45+
Primary run, fixed variant order per bench
46+
(Direct → ReactorToday → ReactorV2 → **ReactorDescriptors last**).
47+
Full per-cell table with 95% CI in `summary.md`.
48+
49+
| Bench | Desc vs Today (ns) | Desc vs ReactorV2 (ns) | Trust |
50+
|---|---:|---:|---|
51+
| M1 Mount_Leaf_NoCallback | +30.1% | -1.6% | **High** (fast, stable) |
52+
| M2 Mount_Leaf_OneCallback | +23.4% | +36.1% | **Low** (drift-contaminated) |
53+
| M3 Mount_Leaf_ThreeCallbacks | +175.3% | +119.0% | **Invalid** (drift-contaminated) |
54+
| M4 Dispatch_Switch_Cold | -17.4% | -22.3% | Low (drift) |
55+
| M5 Dispatch_Switch_Warm | -30.4% | -28.9% | Low (drift) |
56+
| M6 Dispatch_ExternalType | -4.0% | -0.8% | High |
57+
| M7 Update_NoChange | +8.9% | +3.5% | **High** (fast, stable) |
58+
| M8 Update_OneLeafChanged | +17.9% | +1.4% | **High** (fast, stable) |
59+
| M9 Update_AllChanged | +3.4% | +1.2% | Medium (long but alloc-bound) |
60+
| M10 EventHandlerState_Alloc| +17.4% | +15.9% | Low (drift) |
61+
| M11 ModifierEHS_Frequency | +11.5% | +0.5% | **High** (fast, stable) |
62+
| M12 Pool_Rent_HotPath | +44.2% | +5.6% | Low (drift) |
63+
| M13 Setters_Suppression | -3.0% | -4.0% | High (correctness bench) |
64+
65+
## Why these numbers are contaminated — the drift evidence
66+
67+
Within a single launch, `ReactorDescriptors` mean ns climbs steeply from
68+
rep0 → rep4 on the long-running benches, while the short benches stay
69+
flat:
70+
71+
| Bench | rep0 → rep4 climb (Descriptors) | per-rep duration |
72+
|---|---:|---|
73+
| M1 | +24% | ~40 µs |
74+
| M2 | +45% | ~95 µs |
75+
| M3 | +55% | ~1.7 ms |
76+
| M4 | +28% | ~100 µs |
77+
| M5 | +30% | ~100 µs |
78+
| M12 | +11% | ~55 µs |
79+
| M7 / M8 / M11 / M13 | ≈flat | ≤10 µs |
80+
81+
The climb tracks bench duration, not the variant — the classic
82+
signature of a CPU shedding clock under sustained load on a fanless /
83+
thermally-limited ARM64 laptop. Because the four variants run
84+
back-to-back within each bench and `ReactorDescriptors` is **always
85+
scheduled last**, it runs against the hottest core in each bench window.
86+
The means are therefore **not independent of run position**.
87+
88+
## Decisive control — order-swap re-run (gating benches)
89+
90+
To separate "real regression" from "position artifact" I re-ran the
91+
§13 Q1 gating benches (M1/M2/M5/M7) with the variant order **reversed**
92+
so `ReactorDescriptors` runs **first / cold** and `ReactorToday` runs
93+
last / hot (`--variant ReactorDescriptors ReactorV2 ReactorToday`,
94+
3 launches). Raw data: `confirm-reversed-launch-{1,2,3}.jsonl`.
95+
96+
| Bench | Desc vs Today — Desc LAST | Desc vs Today — Desc FIRST | swing |
97+
|---|---:|---:|---:|
98+
| M1 | +30.1% | +32.5% | +2.4pp (stable) |
99+
| M2 | +23.4% | **-30.5%** | **-54.0pp (sign flip)** |
100+
| M5 | -30.4% | -7.2% | +23.3pp |
101+
| M7 | +8.9% | +128.2% | +119.4pp (see note) |
102+
103+
| Bench | Desc vs ReactorV2 — LAST | Desc vs ReactorV2 — FIRST |
104+
|---|---:|---:|
105+
| M1 | -1.6% | +9.2% |
106+
| M2 | **+36.1%** | **+1.1%** |
107+
| M5 | -28.9% | -11.4% |
108+
| M7 | +3.5% | +113.7% (see note) |
109+
110+
**Reading:**
111+
112+
- **M2 is the headline proof.** Its Descriptors-vs-Today delta flips
113+
from **+23.4%** (Descriptors last) to **−30.5%** (Descriptors first) —
114+
a 54-percentage-point swing driven purely by execution position. The
115+
order-robust Descriptors-vs-ReactorV2 comparison collapses from +36.1%
116+
to **+1.1%** when both variants sit in comparable positions. **There
117+
is no real M2 descriptor regression** — the formal Q1 failure in the
118+
primary table is a contamination artifact.
119+
- **M1 is order-robust** (+30% vs Today in both orderings) and is
120+
Descriptors ≈ ReactorV2 (±10pp). This is the genuine **V1-protocol
121+
vs legacy mount overhead** seen in every prior capture (it is not
122+
descriptor-specific — hand-coded V1 pays the same).
123+
- **M5** stays a Descriptors win in both orderings (direction robust).
124+
- **M7 reversed has its own artifact** — a rep0→rep1 step jump
125+
(~10 µs → ~27 µs) appears for Descriptors in the small-selection
126+
reversed run (likely JIT tiering / background recompilation specific
127+
to the reduced job set). The **full-run** M7 (+8.9% vs Today, +3.5%
128+
vs V2, flat across reps) is the trustworthy M7 number; the reversed
129+
M7 should be disregarded.
130+
131+
## What can and cannot be concluded
132+
133+
**Supported by the thermally-insensitive (fast, flat) benches**
134+
M1, M7, M8, M11, M13 — where Descriptors vs ReactorV2 is within ±5%
135+
(M1 -1.6%, M7 +3.5%, M8 +1.4%, M11 +0.5%, M13 -4.0%): in paths that do
136+
not heat the core, **descriptor dispatch/interpreter overhead over
137+
hand-coded V1 is small.** This is consistent with the Phase 2 stable-AC
138+
capture and the x64 advisory captures. It does **not** prove "descriptors
139+
add zero cost" globally — the drift-contaminated long benches are simply
140+
unmeasurable on this run.
141+
142+
**Unresolved on this capture** (require a thermally-clean re-run):
143+
M2, M3, M4, M5, M10, M12. M3 +175.3% in particular is **invalidated by
144+
drift**, not shown to be real — but also not shown to be benign; M3
145+
exercises the 3-callback wiring path and deserves a clean measurement.
146+
147+
**Allocation note.** This README interprets timing. Allocation deltas are
148+
in `summary.md`; they are not the gating axis for §13 Q1 (which keys off
149+
ns vs ReactorV2). A few are worth a glance on the clean re-run — e.g.
150+
M7 Descriptors alloc is higher than Today (extra `EventHandlerState` /
151+
binding state on the V1 path), consistent with the known V1 memory
152+
profile rather than a new regression.
153+
154+
## Recommendation
155+
156+
1. **Do not cite these primary deltas in §13/§14 spec text.** Treat this
157+
capture as *inconclusive* for ratification.
158+
2. **Re-run on `LAPTOP-4MEP83VI` under controlled thermal conditions**
159+
before closing the §14 gate. Concretely, mitigate the order/thermal
160+
confound with one or more of:
161+
- randomize / rotate variant order per launch (or interleave per rep);
162+
- insert a cooldown (`Start-Sleep`) between variants and between benches;
163+
- reduce `--iterations` so each bench window is shorter / cooler;
164+
- capture CPU effective-clock / package-temp telemetry alongside the run
165+
so "thermal" stops being an inference.
166+
The clean run must put M2 back under the Q1 threshold (the order-swap
167+
says it will: Desc ≈ V2 at +1.1%) and give a real M3 number.
168+
3. Until that clean run lands, the §14 ARM64 gate stays **pending**, now
169+
with a named owner/date to be appended in the spec.
170+
171+
## Files
172+
173+
- `launch-{1,2,3}.jsonl` — primary 3×5 capture (fixed variant order). 780 rows.
174+
- `summary.md` — aggregator output (per-cell means + 95% CI + Q1 deltas).
175+
- `confirm-reversed-launch-{1,2,3}.jsonl` — order-swap control
176+
(Descriptors first), gating benches M1/M2/M5/M7. 180 rows.
177+
- `aggregate.py` — reads `launch-*.jsonl`; run with no args from this dir.
178+
179+
## Reproduce
180+
181+
```powershell
182+
dotnet build tests/perf_bench/PerfBench.ControlModel -c Release -p:Platform=ARM64
183+
$exe = "tests\perf_bench\PerfBench.ControlModel\bin\ARM64\Release\net10.0-windows10.0.22621.0\PerfBench.ControlModel.exe"
184+
$out = "docs\specs\047\phase3-results\LAPTOP-4MEP83VI\2026-05-28-phase3-completion-3x5-stableac"
185+
$results = "tests\perf_bench\PerfBench.ControlModel\bin\ARM64\Release\net10.0-windows10.0.22621.0\results.jsonl"
186+
for ($i = 1; $i -le 3; $i++) {
187+
Remove-Item $results -ErrorAction SilentlyContinue
188+
Start-Process -FilePath $exe -Wait -NoNewWindow # -Wait required; & $exe does not block this WinUI app
189+
Copy-Item $results "$out\launch-$i.jsonl"
190+
}
191+
python "$out\aggregate.py" > "$out\summary.md"
192+
193+
# order-swap control:
194+
for ($i = 1; $i -le 3; $i++) {
195+
Remove-Item $results -ErrorAction SilentlyContinue
196+
Start-Process -FilePath $exe -Wait -NoNewWindow -ArgumentList @(
197+
"--test","M1","M2","M5","M7","--variant","ReactorDescriptors","ReactorV2","ReactorToday")
198+
Copy-Item $results "$out\confirm-reversed-launch-$i.jsonl"
199+
}
200+
```
201+
202+
## Captures index
203+
204+
- `../../phase2-results/LAPTOP-4MEP83VI/2026-05-26-q1-fastpath-3x5-stableac/`
205+
— Phase 2 Q1 stable-AC capture (clean; M1 -1.0%, M2 +9.6%). The
206+
reference for what a thermally-clean ARM64 run looks like.
207+
- `../CPC-ander-YTZ3O-x64-advisory/2026-05-28-phase3-finish-3x5/`
208+
latest x64 Cloud-PC advisory (M3 -1.8%, well within noise) — supports
209+
the "M3 +175% is contamination" reading but is itself advisory-only.
210+
- `./` (this dir) — Phase 3 completion ARM64 attempt. **Not ratifying**
211+
due to thermal/order drift; superseded once a clean ARM64 re-run lands.
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
"""Spec 047 §14 Phase 2 (Q1 spike) — aggregate launch-N.jsonl into a means
2+
+ 95% CI table per (bench, variant), and emit the Q1 decision-matrix deltas
3+
(ReactorDescriptors vs ReactorV2, ReactorDescriptors vs ReactorToday).
4+
5+
Usage: python aggregate.py # reads launch-*.jsonl in CWD
6+
"""
7+
import glob
8+
import json
9+
import math
10+
import statistics
11+
from collections import defaultdict
12+
13+
14+
def main():
15+
rows = []
16+
for path in sorted(glob.glob("launch-*.jsonl")):
17+
with open(path, "r", encoding="utf-8") as f:
18+
for line in f:
19+
line = line.strip()
20+
if not line:
21+
continue
22+
row = json.loads(line)
23+
if row.get("status") != "ok":
24+
continue
25+
rows.append(row)
26+
27+
# Group by (benchId, variant).
28+
buckets = defaultdict(list)
29+
for r in rows:
30+
buckets[(r["benchId"], r["variant"])].append(r)
31+
32+
benches = sorted({b for (b, _) in buckets}, key=_bench_key)
33+
variants = ["ReactorToday", "ReactorV2", "ReactorDescriptors"]
34+
35+
def summarize(rs, key):
36+
vals = [r[key] for r in rs]
37+
if not vals:
38+
return (math.nan, math.nan, 0)
39+
mean = statistics.mean(vals)
40+
if len(vals) > 1:
41+
stdev = statistics.stdev(vals)
42+
# 95% CI half-width for a t-distribution. For n=15 dof=14, t ≈ 2.145.
43+
# Approximate with 1.96 for simplicity — close enough at n≥10.
44+
ci_half = 1.96 * stdev / math.sqrt(len(vals))
45+
else:
46+
ci_half = math.nan
47+
return mean, ci_half, len(vals)
48+
49+
# ── Per-(bench, variant) summary table. ──
50+
print("# Per-(bench, variant) means")
51+
print()
52+
print(f"| Bench | Variant | n | Mean ns | 95% CI ±ns | Mean alloc B | 95% CI ±B |")
53+
print(f"|---|---|---:|---:|---:|---:|---:|")
54+
for b in benches:
55+
for v in variants:
56+
rs = buckets.get((b, v), [])
57+
mean_ns, ci_ns, n = summarize(rs, "meanNs")
58+
mean_b, ci_b, _ = summarize(rs, "allocBytes")
59+
if n == 0:
60+
print(f"| {b} | {v} | 0 | — | — | — | — |")
61+
else:
62+
print(
63+
f"| {b} | {v} | {n} | {mean_ns:,.0f} | {ci_ns:,.0f} "
64+
f"| {mean_b:,.0f} | {ci_b:,.0f} |"
65+
)
66+
print(f"| | | | | | | |")
67+
68+
# ── Q1 decision-matrix deltas. ──
69+
print()
70+
print("# Q1 head-to-head — ReactorDescriptors deltas")
71+
print()
72+
print(
73+
"| Bench | vs ReactorV2 ns | vs ReactorV2 alloc | vs ReactorToday ns | vs ReactorToday alloc | Q1 band |"
74+
)
75+
print("|---|---:|---:|---:|---:|---|")
76+
for b in benches:
77+
ds = buckets.get((b, "ReactorDescriptors"), [])
78+
v2 = buckets.get((b, "ReactorV2"), [])
79+
today = buckets.get((b, "ReactorToday"), [])
80+
d_ns, _, _ = summarize(ds, "meanNs")
81+
d_b, _, _ = summarize(ds, "allocBytes")
82+
v_ns, _, _ = summarize(v2, "meanNs")
83+
v_b, _, _ = summarize(v2, "allocBytes")
84+
t_ns, _, _ = summarize(today, "meanNs")
85+
t_b, _, _ = summarize(today, "allocBytes")
86+
87+
def pct(a, base):
88+
if base and not math.isnan(base) and not math.isnan(a):
89+
return (a - base) / base * 100.0
90+
return math.nan
91+
92+
vs_v2_ns = pct(d_ns, v_ns)
93+
vs_v2_b = pct(d_b, v_b)
94+
vs_t_ns = pct(d_ns, t_ns)
95+
vs_t_b = pct(d_b, t_b)
96+
97+
# §13 Q1 matrix bands keyed off the worst of ns vs V2.
98+
worst = vs_v2_ns
99+
if math.isnan(worst):
100+
band = "-"
101+
elif abs(worst) <= 5:
102+
band = "<=5%: ship descriptors"
103+
elif abs(worst) <= 15:
104+
band = "5-15%: judgment call"
105+
else:
106+
band = ">15%: ship hand-coded"
107+
108+
print(
109+
f"| {b} | {vs_v2_ns:+.1f}% | {vs_v2_b:+.1f}% | {vs_t_ns:+.1f}% | {vs_t_b:+.1f}% | {band} |"
110+
)
111+
112+
113+
def _bench_key(s):
114+
# M1, M2, ..., M13 — sort numerically.
115+
try:
116+
return int(s.lstrip("M"))
117+
except ValueError:
118+
return 999
119+
120+
121+
if __name__ == "__main__":
122+
main()

0 commit comments

Comments
 (0)