Skip to content

Commit b0c8174

Browse files
spec-047 §4.9: ARM64 post-Phase-4 micro perf capture (LAPTOP-4MEP83VI) (#465)
* spec-047 §4.9: ARM64 post-Phase-4 micro perf capture (LAPTOP-4MEP83VI) Indicative M1–M13 capture on the §4.9 baseline box, ARM64-native/Release/.NET 10.0.8, reps=5, iters matched to the 2026-05-25 baseline (M1–M8 @5000, M9 @2000, M10–M13 @1000). Adds raw JSONL, the aggregator-out tables, RESULTS.md (cross- baseline comparison), and analyze.py. Allocation (deterministic, valid — Direct alloc matches baseline byte-for-byte): - §15.6 "M1–M3 alloc ≤ Today": M2 −5%, M3 −6% PASS; M1 +20% FAIL. - §11.6 byte gates: M3 PASS; M1 (3.2×) and M2 (2.4×) over target. - vs baseline ReactorV2: most benches flat/better (M9 −41% standout); M1 +20% and M12 +17% are real, deterministic regressions to investigate. Confirms the KD-3 trigger (M1 over budget). NOT a ratification sign-off: §15.5 isolation (AC/High-Perf/DRR/foreground) was not enforced, so the timing axis is environment-throttled (Direct ns +60–140% vs baseline) and must be disregarded; the §4.9 randomized/interleaved ordering + CPU-clock telemetry is not wired; and the macro suite (L1–L14) is unrunnable (its projects were deleted in Phase 4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * spec-047: record §4.9/§11.6 measured results from the ARM64 capture (PR #465) Updates the spec body + both trackers to reflect the indicative LAPTOP-4MEP83VI capture now that the deterministic allocation axis is measured: - §4.4/§11.6 byte gates MEASURED: M3 PASS; M1 (3.2×) + M2 (2.4×) FAIL. - §15.6 "M1–M3 alloc ≤ Today": M2/M3 PASS, M1 +20.3% FAIL; M12 +17% regressed. - KD-3 trigger CONFIRMED (M1 over budget) — fold warranted + investigate the bucketing regression. - Gate stays OPEN: timing axis throttled (no §15.5 isolation), macro suite unrunnable (projects deleted in Phase 4); needs an isolated re-capture + the M1/M12 alloc fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 50667ad commit b0c8174

13 files changed

Lines changed: 737 additions & 15 deletions

docs/specs/047-extensible-control-model.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1530,7 +1530,7 @@ ARM64 stable-AC re-capture on `LAPTOP-4MEP83VI` remains deferred for the §14 ra
15301530
- **New regressions vs close-out:** M8 +21.8% (+2.9pp — Lazy*Stack base-derived registration's added is-check in the Update path), M12 +30.7% (+12.2pp — Cloud-PC volatile; M12 has trended ±15pp across the last three captures and should be confirmed on stable AC).
15311531
- **Net headline:** no bench exceeds the §13 Q1 reopen threshold. The structural wins (dispatch consolidation, single `IItemsBinderStrategy` arm) are in place; the absolute Cloud-PC numbers track the close-out baseline.
15321532

1533-
**ARM64 stable-AC ratification gate** — **still pending; first capture attempt was inconclusive.** An ARM64-native 3×5 capture on `LAPTOP-4MEP83VI` (the Phase 0/2 baseline machine) landed under [`docs/specs/047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/`](047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/README.md) but **does not ratify the gate**: the fixed variant-ordering run drifted under sustained load (suspected thermal throttling — `ReactorDescriptors` always runs last and so against the hottest core), inflating long-bench deltas (M2 +23.4%, M3 +175.3%, M12 +44.2% vs Today). A controlled **order-swap re-run** (Descriptors first/cold) proves the contamination: M2's Descriptors-vs-Today delta flips from +23.4% to −30.5% (a 54pp position swing), and Descriptors-vs-ReactorV2 collapses from +36.1% to +1.1% — i.e. no real M2 regression. The thermally-insensitive fast benches confirm descriptors ≈ hand-coded V1 (M1/M7/M8/M11/M13 within ±5% vs ReactorV2), and M1's order-robust +30% vs Today is the known V1-protocol-vs-legacy mount overhead, not descriptor-specific. **A thermally-clean ARM64 re-run** (randomized/interleaved variant order, cooldowns, and/or CPU-clock telemetry) is still required to close the gate; until then it remains pending with a named owner + date to be appended. See the capture README for the full drift evidence and reproduction steps.
1533+
**ARM64 stable-AC ratification gate** — **still pending; first capture attempt was inconclusive.** An ARM64-native 3×5 capture on `LAPTOP-4MEP83VI` (the Phase 0/2 baseline machine) landed under [`docs/specs/047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/`](047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/README.md) but **does not ratify the gate**: the fixed variant-ordering run drifted under sustained load (suspected thermal throttling — `ReactorDescriptors` always runs last and so against the hottest core), inflating long-bench deltas (M2 +23.4%, M3 +175.3%, M12 +44.2% vs Today). A controlled **order-swap re-run** (Descriptors first/cold) proves the contamination: M2's Descriptors-vs-Today delta flips from +23.4% to −30.5% (a 54pp position swing), and Descriptors-vs-ReactorV2 collapses from +36.1% to +1.1% — i.e. no real M2 regression. The thermally-insensitive fast benches confirm descriptors ≈ hand-coded V1 (M1/M7/M8/M11/M13 within ±5% vs ReactorV2), and M1's order-robust +30% vs Today is the known V1-protocol-vs-legacy mount overhead, not descriptor-specific. **A thermally-clean ARM64 re-run** (randomized/interleaved variant order, cooldowns, and/or CPU-clock telemetry) is still required to close the gate; until then it remains pending with a named owner + date to be appended. See the capture README for the full drift evidence and reproduction steps. **Phase-4 update (PR #465):** a post-Phase-4 capture landed under [`docs/specs/047/phase4-results/LAPTOP-4MEP83VI/2026-05-29-arm64/`](047/phase4-results/LAPTOP-4MEP83VI/2026-05-29-arm64/RESULTS.md); it **still does not close the gate** (same gap — fixed ordering, no §15.5 isolation, so the timing axis is throttled and the macro suite is unrunnable post-Phase-4). Its value is the deterministic **allocation** axis: most benches held/improved vs the 2026-05-25 baseline (M9 −41%), but **M1 regressed +20%** (3.2× over its 407 B gate) and **M12 +17%** — so the M1 leaf-alloc work (KD-3 fold + bucketing-regression investigation) is now confirmed as required, ahead of the thermally-clean re-run.
15341534

15351535
**Carry-forward known defects from Phase 1:**
15361536
- **KD-3** — dispatch fast-path for the ported built-ins (M4 was +88.9% V1 vs Today at Phase 1; final advisory shows M4 −21.2% / M5 −24.3% at amortized scope — KD-3 has materially closed at the batch-11 registration set).
@@ -1541,7 +1541,14 @@ ARM64 stable-AC re-capture on `LAPTOP-4MEP83VI` remains deferred for the §14 ra
15411541
**Status: code-complete — migration closed; V1 is the unconditional production
15421542
path.** The only outstanding items are baseline-machine-only (ARM64
15431543
`LAPTOP-4MEP83VI`): the stable-AC perf ratification and the §11.6 hard byte-gate
1544-
*measurement/enforcement*. See the close-out tracker
1544+
*measurement/enforcement*. An **indicative ARM64 capture has landed** (PR #465,
1545+
`047/phase4-results/LAPTOP-4MEP83VI/2026-05-29-arm64/`): the deterministic
1546+
**allocation** axis is measured — M2/M3 meet the §15.6 "≤ Today" budget, **M1
1547+
regressed +20%** (and M1/M2 miss the absolute 407/1,520 B gates; M3 passes), plus
1548+
an **M12 +17%** pool-reuse regression. The **timing** axis (no §15.5 isolation)
1549+
and the **macro suite** (its projects were deleted in Phase 4) remain unratified,
1550+
so the gate is **not yet closed** — it needs an isolated stable-AC re-capture and
1551+
the M1/M12 alloc fix. See the close-out tracker
15451552
[`tasks/047-extensible-control-model-phase4-implementation.md`](tasks/047-extensible-control-model-phase4-implementation.md).
15461553

15471554
- ✅ Delete the private switch. *(Done §4.5 — dispatch is V1 registry →
@@ -1560,8 +1567,11 @@ path.** The only outstanding items are baseline-machine-only (ARM64
15601567
for no-callback / one-callback / three-callback; the stale `≤100 / ≤320 / ≤500`
15611568
estimates predate the Phase-0 baseline capture). *(Code-complete: the bucketed
15621569
`Element` base (§11.7, `ElementExtras`) ships and the target constants are
1563-
landed (`PerformanceBudgets.cs`); the gate **measurement/enforcement** is
1564-
ARM64-baseline-blocked — §4.4/§4.9 handoff.)*
1570+
landed (`PerformanceBudgets.cs`); the gate has now been **MEASURED** on
1571+
`LAPTOP-4MEP83VI` ARM64 (PR #465): **M1 1,289 B (FAIL, 3.2×), M2 3,687 B
1572+
(FAIL, 2.4×), M3 8,530 B (PASS)** per-render. The gates do **not** pass for
1573+
M1/M2 — enforcement stays open pending the M1 leaf-alloc fix + an isolated
1574+
re-capture. §4.4/§4.9 handoff.)*
15651575
- ✅ Document the final author-facing surface in `docs/guide/`. *(Done §4.8.)*
15661576

15671577
### Future: source generation (deferred, no committed timeline)
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Spec-047 Post-Phase-4 Perf Capture — vs 2026-05-25 ARM64 baseline
2+
3+
**Machine:** `LAPTOP-4MEP83VI` (Qualcomm ARMv8, the spec-047 §4.9 baseline box)
4+
**Arch/Runtime:** ARM64-native, Release, .NET 10.0.8 — identical to baseline
5+
**Date:** 2026-05-30 (UTC) · **Branch:** `main` (all of spec-047 incl. Phase 4 merged)
6+
**Suite:** micro M1–M13 (`PerfBench.ControlModel`), reps=5, iters matched to baseline
7+
(M1–M8 @5000, M9 @2000, M10–M13 @1000). 195 rows, 0 errors.
8+
9+
> ⚠️ **Scope caveat — this is an INDICATIVE capture, not the formal §4.9 ratification.**
10+
> The §15.5 environment-isolation requirements (AC power, High-Performance plan, DRR
11+
> off, foreground non-occluded window) could **not** be enforced from this automated
12+
> run, and the harness does not yet implement the §4.9-required randomized/interleaved
13+
> variant ordering + CPU-clock telemetry. **Consequence: the timing (ns) numbers are
14+
> environment-contaminated and must be disregarded** for cross-baseline comparison —
15+
> the `Direct` variant (pure WinUI, *zero* Reactor code) is itself inflated +60–140%
16+
> vs the baseline run, which can only be thermal/power throttling. **The allocation
17+
> (bytes) numbers ARE valid**: managed allocation is deterministic and
18+
> environment-independent — confirmed by `Direct` alloc matching the baseline
19+
> byte-for-byte (M1 Direct = 3,771,824 B in both runs).
20+
21+
---
22+
23+
## Headline findings (allocation — the valid, deterministic axis)
24+
25+
The macro suite (L1–L14: TTFF / working-set / FPS / GC) is **not runnable** — Phase 4
26+
deleted its projects (`StressPerf.ReactorV2`, `BlankReactorV2`). So only the §15.6
27+
micro budgets (per-element alloc M1–M3, dispatch M4–M6, update M7–M8) are covered here.
28+
29+
### 1. §15.6 "M1–M3 per-element alloc must improve/equal Today" — **M1 FAILS**
30+
31+
| Bench | Reactor (new) B/render | Today (base) B/render | Δ vs Today | Verdict |
32+
|---|---:|---:|---:|:--|
33+
| **M1** TextBlock, no callback | **1,289** | 1,071 | **+20.3%** |**regressed** |
34+
| M2 ToggleSwitch, 1 callback | 3,687 | 3,884 | −5.1% | ✅ improved |
35+
| M3 Button + 2 pointer mods | 8,530 | 9,075 | −6.0% | ✅ improved |
36+
37+
### 2. Phase-4 refactor impact: current `Reactor` vs **baseline `ReactorV2`** (same V1 lineage)
38+
39+
This isolates what the post-baseline Phase-4 work (`ElementExtras` bucketing §4.4,
40+
EHS split §4.3, echo hybrid §4.2) did to the V1 path's allocation:
41+
42+
| Bench | new B/render | base-V2 B/render | Δ | Note |
43+
|---|---:|---:|---:|:--|
44+
| **M1** | **1,289** | 1,077 | **+19.6%** | ❌ leanest leaf got **heavier** |
45+
| M2 | 3,687 | 3,864 | −4.6% ||
46+
| M3 | 8,530 | 8,633 | −1.2% | ≈ flat |
47+
| M4 | 1,941 | 1,998 | −2.8% ||
48+
| M5 | 1,948 | 2,212 | −11.9% ||
49+
| M6 | 888 | 941 | −5.6% ||
50+
| M7 | 252 | 156 | +61.4% | tiny absolute (+96 B) |
51+
| M8 | 362 | 425 | −14.9% ||
52+
| **M9** | 184,431 | 312,246 | **−40.9%** | ✅ big win (keyed list) |
53+
| M10 | 3,411 | 3,949 | −13.6% ||
54+
| M11 | 1,641 | 1,670 | −1.7% | ✅ (per-element state) |
55+
| **M12** | 1,273 | 1,088 | **+17.0%** | ❌ pool-reuse regressed |
56+
| M13 | 29 | 29 | −0.4% | ≈ flat |
57+
58+
The M1 regression is **deterministic, not noise**: every new rep (6.34–6.51 MB)
59+
sits uniformly above every baseline rep (5.25–5.42 MB) — a consistent ~+235 B/render.
60+
Likely sources to investigate: the added `Element.Extensions` slot on every element,
61+
the §4.3 EHS-split, or the `ReactorState.PendingEchoMatch` slot on the mount path.
62+
M12 (pool rent/return) similarly regressed +17%.
63+
64+
### 3. §11.6 absolute byte-gate (`PerformanceBudgets.cs`) — **M1, M2 FAIL**
65+
66+
| Bench | Target | Reactor (new) B/render | Pass? |
67+
|---|---:|---:|:---:|
68+
| M1 | ≤ 407 | 1,289 | ❌ (3.2×) |
69+
| M2 | ≤ 1,520 | 3,687 | ❌ (2.4×) |
70+
| M3 | ≤ 19,200 | 8,530 ||
71+
72+
Note the gate targets were defined as `baseline × 0.4`, but the *measured* ARM64
73+
baselines were ~1,077 / 3,864 / 8,633 — so M1/M2 never had a realistic path to
74+
407/1,520 without the deferred KD-3 binder-check fold + further leaf-alloc work, and
75+
M3's 19,200 target was already cleared at baseline. **The byte gates as written are
76+
not met for M1/M2.** This directly confirms the spec's own KD-3 trigger condition
77+
("fold the M1 leading-`if` binder check … if M1 is still above budget after §4.3/§4.4")
78+
— M1 *is* over budget, so that follow-up is now warranted.
79+
80+
---
81+
82+
## Timing (ns) — captured but NOT comparable cross-baseline
83+
84+
Disregard for ratification. Evidence of environment contamination (identical `Direct`
85+
code, new vs baseline ns): M3 +139%, M4 +130%, M5 +60%, M7 +940µs absolute swing.
86+
Within-run `Reactor`-vs-`Direct` overhead is directionally consistent with baseline
87+
(Reactor adds dispatch cost on M1–M6, wins big on M7/M9 via pooling) but the absolute
88+
numbers are throttled and should be re-captured under §15.5 isolation before any
89+
timing-budget sign-off.
90+
91+
---
92+
93+
## Bottom line for §4.9
94+
95+
-**Build + capture reproducible on the actual ARM64 baseline box**; allocation is
96+
deterministic and matches baseline `Direct` byte-for-byte.
97+
-**Most of the V1 path held or improved** vs the captured baseline on allocation
98+
(M2/M3/M4/M5/M6/M8/M9/M10/M11), with a standout **−41% on M9** (keyed list).
99+
-**Two allocation regressions to fix before claiming the byte-gate pass:**
100+
**M1 +20%** (and 3.2× over its 407 B gate) and **M12 +17%**.
101+
-**Not a ratification sign-off:** timing axis is environment-throttled, the
102+
§4.9-mandated randomized/interleaved ordering + CPU-clock telemetry isn't wired,
103+
and the macro suite (L1–L14) can't run (projects deleted). A real §4.9 close needs
104+
an isolated stable-AC re-capture (and the macro suite rebuilt against the single
105+
`Reactor` variant).
106+
107+
_Raw data: `perfbench-controlmodel-{m1-m8,m9,m10-m13}.jsonl` in this folder.
108+
Analysis: `analyze.py`. Baseline: `docs/specs/047/baseline-results/LAPTOP-4MEP83VI/2026-05-25-arm64/`._

docs/specs/047/phase4-results/LAPTOP-4MEP83VI/2026-05-29-arm64/aggregator-out/excluded.txt

Whitespace-only changes.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Spec 047 §15.6 (a) — Absolute Comparison
2+
3+
Mean ns per op + alloc bytes, per variant. Columns are dashes when a variant has < min-reps repetitions. Architecture column distinguishes ARM64-native from x64-emulated runs (spec §15.5 — non-comparable across architectures).
4+
5+
| Bench | Arch | Direct ns | Today ns | Reactor ns | Direct alloc | Today alloc | Reactor alloc |
6+
|---|---|---:|---:|---:|---:|---:|---:|
7+
| M1 | Arm64 | 35795.2 | 43113.7 | 43820.5 | 3771877 | 6410214 | 6442971 |
8+
| M10 | Arm64 | 33549.5 | 44664.7 | 43450.0 | 2958312 | 3558654 | 3410949 |
9+
| M11 | Arm64 | 33.9 | 38605.7 | 34239.4 | 40 | 1714131 | 1641088 |
10+
| M12 | Arm64 | 25807.4 | 33245.4 | 34545.0 | 760114 | 1306086 | 1273350 |
11+
| M13 | Arm64 | 35.5 | 106.8 | 220.6 | 24040 | 29373 | 29320 |
12+
| M2 | Arm64 | 50476.4 | 112278.4 | 98428.2 | 13425966 | 18317597 | 18436637 |
13+
| M3 | Arm64 | 422189.5 | 378451.1 | 390740.2 | 28890936 | 41535106 | 42649842 |
14+
| M4 | Arm64 | 76426.5 | 144403.7 | 139003.5 | 4674357 | 10708440 | 9707059 |
15+
| M5 | Arm64 | 34110.0 | 157595.6 | 134839.9 | 4674357 | 10736699 | 9741629 |
16+
| M6 | Arm64 | 43099.6 | 59261.6 | 55663.5 | 3869357 | 4181165 | 4438270 |
17+
| M7 | Arm64 | 1737075.3 | 22028.2 | 21858.9 | 122599664 | 996032 | 1258197 |
18+
| M8 | Arm64 | 7617.1 | 9601.4 | 9258.0 | 915536 | 1807872 | 1807872 |
19+
| M9 | Arm64 | 911654.9 | 3102404.2 | 2483238.8 | 96669803 | 368877754 | 368861339 |
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Spec 047 §15.6 (b) — Reactor Delta (Reactor vs Today)
2+
3+
Positive % = Reactor slower / larger than Today. Negative = improvement. One row per (bench, architecture).
4+
5+
| Bench | Arch | ns delta % | ns 95% CI half-width | alloc delta % |
6+
|---|---|---:|---:|---:|
7+
| M1 | Arm64 | +1.6% | ±5.1% | +0.5% |
8+
| M10 | Arm64 | -2.7% | ±4.0% | -4.2% |
9+
| M11 | Arm64 | -11.3% | ±4.4% | -4.3% |
10+
| M12 | Arm64 | +3.9% | ±8.5% | -2.5% |
11+
| M13 | Arm64 | +106.6% | ±220.6% | -0.2% |
12+
| M2 | Arm64 | -12.3% | ±6.1% | +0.6% |
13+
| M3 | Arm64 | +3.2% | ±23.9% | +2.7% |
14+
| M4 | Arm64 | -3.7% | ±8.8% | -9.4% |
15+
| M5 | Arm64 | -14.4% | ±16.3% | -9.3% |
16+
| M6 | Arm64 | -6.1% | ±9.2% | +6.1% |
17+
| M7 | Arm64 | -0.8% | ±4.8% | +26.3% |
18+
| M8 | Arm64 | -3.6% | ±4.1% | 0.0% |
19+
| M9 | Arm64 | -20.0% | ±23.3% | 0.0% |
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Spec 047 §15.6 (c) — WinUI Gap (Reactor vs Direct)
2+
3+
Absolute overhead Reactor still adds on top of raw WinUI. One row per (bench, architecture).
4+
5+
| Bench | Arch | Reactor ns | Direct ns | Reactor - Direct ns | Reactor alloc - Direct alloc |
6+
|---|---|---:|---:|---:|---:|
7+
| M1 | Arm64 | 43820.5 | 35795.2 | +8025.4 | +2671094 |
8+
| M10 | Arm64 | 43450.0 | 33549.5 | +9900.5 | +452637 |
9+
| M11 | Arm64 | 34239.4 | 33.9 | +34205.5 | +1641048 |
10+
| M12 | Arm64 | 34545.0 | 25807.4 | +8737.6 | +513237 |
11+
| M13 | Arm64 | 220.6 | 35.5 | +185.1 | +5280 |
12+
| M2 | Arm64 | 98428.2 | 50476.4 | +47951.8 | +5010670 |
13+
| M3 | Arm64 | 390740.2 | 422189.5 | -31449.3 | +13758906 |
14+
| M4 | Arm64 | 139003.5 | 76426.5 | +62577.0 | +5032702 |
15+
| M5 | Arm64 | 134839.9 | 34110.0 | +100729.9 | +5067272 |
16+
| M6 | Arm64 | 55663.5 | 43099.6 | +12563.8 | +568914 |
17+
| M7 | Arm64 | 21858.9 | 1737075.3 | -1715216.4 | -121341467 |
18+
| M8 | Arm64 | 9258.0 | 7617.1 | +1640.8 | +892336 |
19+
| M9 | Arm64 | 2483238.8 | 911654.9 | +1571583.9 | +272191536 |

0 commit comments

Comments
 (0)