Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions docs/specs/047-extensible-control-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -1530,7 +1530,7 @@ ARM64 stable-AC re-capture on `LAPTOP-4MEP83VI` remains deferred for the §14 ra
- **New regressions vs close-out:** M8 +21.8% (+2.9pp — Lazy*Stack base-derived registration's added is-check in the Update path), M12 +30.7% (+12.2pp — Cloud-PC volatile; M12 has trended ±15pp across the last three captures and should be confirmed on stable AC).
- **Net headline:** no bench exceeds the §13 Q1 reopen threshold. The structural wins (dispatch consolidation, single `IItemsBinderStrategy` arm) are in place; the absolute Cloud-PC numbers track the close-out baseline.

**ARM64 stable-AC ratification gate** — **still pending; first capture attempt was inconclusive.** An ARM64-native 3×5 capture on `LAPTOP-4MEP83VI` (the Phase 0/2 baseline machine) landed under [`docs/specs/047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/`](047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/README.md) but **does not ratify the gate**: the fixed variant-ordering run drifted under sustained load (suspected thermal throttling — `ReactorDescriptors` always runs last and so against the hottest core), inflating long-bench deltas (M2 +23.4%, M3 +175.3%, M12 +44.2% vs Today). A controlled **order-swap re-run** (Descriptors first/cold) proves the contamination: M2's Descriptors-vs-Today delta flips from +23.4% to −30.5% (a 54pp position swing), and Descriptors-vs-ReactorV2 collapses from +36.1% to +1.1% — i.e. no real M2 regression. The thermally-insensitive fast benches confirm descriptors ≈ hand-coded V1 (M1/M7/M8/M11/M13 within ±5% vs ReactorV2), and M1's order-robust +30% vs Today is the known V1-protocol-vs-legacy mount overhead, not descriptor-specific. **A thermally-clean ARM64 re-run** (randomized/interleaved variant order, cooldowns, and/or CPU-clock telemetry) is still required to close the gate; until then it remains pending with a named owner + date to be appended. See the capture README for the full drift evidence and reproduction steps.
**ARM64 stable-AC ratification gate** — **still pending; first capture attempt was inconclusive.** An ARM64-native 3×5 capture on `LAPTOP-4MEP83VI` (the Phase 0/2 baseline machine) landed under [`docs/specs/047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/`](047/phase3-results/LAPTOP-4MEP83VI/2026-05-28-phase3-completion-3x5-stableac/README.md) but **does not ratify the gate**: the fixed variant-ordering run drifted under sustained load (suspected thermal throttling — `ReactorDescriptors` always runs last and so against the hottest core), inflating long-bench deltas (M2 +23.4%, M3 +175.3%, M12 +44.2% vs Today). A controlled **order-swap re-run** (Descriptors first/cold) proves the contamination: M2's Descriptors-vs-Today delta flips from +23.4% to −30.5% (a 54pp position swing), and Descriptors-vs-ReactorV2 collapses from +36.1% to +1.1% — i.e. no real M2 regression. The thermally-insensitive fast benches confirm descriptors ≈ hand-coded V1 (M1/M7/M8/M11/M13 within ±5% vs ReactorV2), and M1's order-robust +30% vs Today is the known V1-protocol-vs-legacy mount overhead, not descriptor-specific. **A thermally-clean ARM64 re-run** (randomized/interleaved variant order, cooldowns, and/or CPU-clock telemetry) is still required to close the gate; until then it remains pending with a named owner + date to be appended. See the capture README for the full drift evidence and reproduction steps. **Phase-4 update (PR #465):** a post-Phase-4 capture landed under [`docs/specs/047/phase4-results/LAPTOP-4MEP83VI/2026-05-29-arm64/`](047/phase4-results/LAPTOP-4MEP83VI/2026-05-29-arm64/RESULTS.md); it **still does not close the gate** (same gap — fixed ordering, no §15.5 isolation, so the timing axis is throttled and the macro suite is unrunnable post-Phase-4). Its value is the deterministic **allocation** axis: most benches held/improved vs the 2026-05-25 baseline (M9 −41%), but **M1 regressed +20%** (3.2× over its 407 B gate) and **M12 +17%** — so the M1 leaf-alloc work (KD-3 fold + bucketing-regression investigation) is now confirmed as required, ahead of the thermally-clean re-run.

**Carry-forward known defects from Phase 1:**
- **KD-3** — dispatch fast-path for the ported built-ins (M4 was +88.9% V1 vs Today at Phase 1; final advisory shows M4 −21.2% / M5 −24.3% at amortized scope — KD-3 has materially closed at the batch-11 registration set).
Expand All @@ -1541,7 +1541,14 @@ ARM64 stable-AC re-capture on `LAPTOP-4MEP83VI` remains deferred for the §14 ra
**Status: code-complete — migration closed; V1 is the unconditional production
path.** The only outstanding items are baseline-machine-only (ARM64
`LAPTOP-4MEP83VI`): the stable-AC perf ratification and the §11.6 hard byte-gate
*measurement/enforcement*. See the close-out tracker
*measurement/enforcement*. An **indicative ARM64 capture has landed** (PR #465,
`047/phase4-results/LAPTOP-4MEP83VI/2026-05-29-arm64/`): the deterministic
**allocation** axis is measured — M2/M3 meet the §15.6 "≤ Today" budget, **M1
regressed +20%** (and M1/M2 miss the absolute 407/1,520 B gates; M3 passes), plus
an **M12 +17%** pool-reuse regression. The **timing** axis (no §15.5 isolation)
and the **macro suite** (its projects were deleted in Phase 4) remain unratified,
so the gate is **not yet closed** — it needs an isolated stable-AC re-capture and
the M1/M12 alloc fix. See the close-out tracker
[`tasks/047-extensible-control-model-phase4-implementation.md`](tasks/047-extensible-control-model-phase4-implementation.md).

- ✅ Delete the private switch. *(Done §4.5 — dispatch is V1 registry →
Expand All @@ -1560,8 +1567,11 @@ path.** The only outstanding items are baseline-machine-only (ARM64
for no-callback / one-callback / three-callback; the stale `≤100 / ≤320 / ≤500`
estimates predate the Phase-0 baseline capture). *(Code-complete: the bucketed
`Element` base (§11.7, `ElementExtras`) ships and the target constants are
landed (`PerformanceBudgets.cs`); the gate **measurement/enforcement** is
ARM64-baseline-blocked — §4.4/§4.9 handoff.)*
landed (`PerformanceBudgets.cs`); the gate has now been **MEASURED** on
`LAPTOP-4MEP83VI` ARM64 (PR #465): **M1 1,289 B (FAIL, 3.2×), M2 3,687 B
(FAIL, 2.4×), M3 8,530 B (PASS)** per-render. The gates do **not** pass for
M1/M2 — enforcement stays open pending the M1 leaf-alloc fix + an isolated
re-capture. §4.4/§4.9 handoff.)*
- ✅ Document the final author-facing surface in `docs/guide/`. *(Done §4.8.)*

### Future: source generation (deferred, no committed timeline)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Spec-047 Post-Phase-4 Perf Capture — vs 2026-05-25 ARM64 baseline

**Machine:** `LAPTOP-4MEP83VI` (Qualcomm ARMv8, the spec-047 §4.9 baseline box)
**Arch/Runtime:** ARM64-native, Release, .NET 10.0.8 — identical to baseline
**Date:** 2026-05-30 (UTC) · **Branch:** `main` (all of spec-047 incl. Phase 4 merged)
**Suite:** micro M1–M13 (`PerfBench.ControlModel`), reps=5, iters matched to baseline
(M1–M8 @5000, M9 @2000, M10–M13 @1000). 195 rows, 0 errors.

> ⚠️ **Scope caveat — this is an INDICATIVE capture, not the formal §4.9 ratification.**
> The §15.5 environment-isolation requirements (AC power, High-Performance plan, DRR
> off, foreground non-occluded window) could **not** be enforced from this automated
> run, and the harness does not yet implement the §4.9-required randomized/interleaved
> variant ordering + CPU-clock telemetry. **Consequence: the timing (ns) numbers are
> environment-contaminated and must be disregarded** for cross-baseline comparison —
> the `Direct` variant (pure WinUI, *zero* Reactor code) is itself inflated +60–140%
> vs the baseline run, which can only be thermal/power throttling. **The allocation
> (bytes) numbers ARE valid**: managed allocation is deterministic and
> environment-independent — confirmed by `Direct` alloc matching the baseline
> byte-for-byte (M1 Direct = 3,771,824 B in both runs).

---

## Headline findings (allocation — the valid, deterministic axis)

The macro suite (L1–L14: TTFF / working-set / FPS / GC) is **not runnable** — Phase 4
deleted its projects (`StressPerf.ReactorV2`, `BlankReactorV2`). So only the §15.6
micro budgets (per-element alloc M1–M3, dispatch M4–M6, update M7–M8) are covered here.

### 1. §15.6 "M1–M3 per-element alloc must improve/equal Today" — **M1 FAILS**

| Bench | Reactor (new) B/render | Today (base) B/render | Δ vs Today | Verdict |
|---|---:|---:|---:|:--|
| **M1** TextBlock, no callback | **1,289** | 1,071 | **+20.3%** | ❌ **regressed** |
| M2 ToggleSwitch, 1 callback | 3,687 | 3,884 | −5.1% | ✅ improved |
| M3 Button + 2 pointer mods | 8,530 | 9,075 | −6.0% | ✅ improved |

### 2. Phase-4 refactor impact: current `Reactor` vs **baseline `ReactorV2`** (same V1 lineage)

This isolates what the post-baseline Phase-4 work (`ElementExtras` bucketing §4.4,
EHS split §4.3, echo hybrid §4.2) did to the V1 path's allocation:

| Bench | new B/render | base-V2 B/render | Δ | Note |
|---|---:|---:|---:|:--|
| **M1** | **1,289** | 1,077 | **+19.6%** | ❌ leanest leaf got **heavier** |
| M2 | 3,687 | 3,864 | −4.6% | ✅ |
| M3 | 8,530 | 8,633 | −1.2% | ≈ flat |
| M4 | 1,941 | 1,998 | −2.8% | ✅ |
| M5 | 1,948 | 2,212 | −11.9% | ✅ |
| M6 | 888 | 941 | −5.6% | ✅ |
| M7 | 252 | 156 | +61.4% | tiny absolute (+96 B) |
| M8 | 362 | 425 | −14.9% | ✅ |
| **M9** | 184,431 | 312,246 | **−40.9%** | ✅ big win (keyed list) |
| M10 | 3,411 | 3,949 | −13.6% | ✅ |
| M11 | 1,641 | 1,670 | −1.7% | ✅ (per-element state) |
| **M12** | 1,273 | 1,088 | **+17.0%** | ❌ pool-reuse regressed |
| M13 | 29 | 29 | −0.4% | ≈ flat |

The M1 regression is **deterministic, not noise**: every new rep (6.34–6.51 MB)
sits uniformly above every baseline rep (5.25–5.42 MB) — a consistent ~+235 B/render.
Likely sources to investigate: the added `Element.Extensions` slot on every element,
the §4.3 EHS-split, or the `ReactorState.PendingEchoMatch` slot on the mount path.
M12 (pool rent/return) similarly regressed +17%.

### 3. §11.6 absolute byte-gate (`PerformanceBudgets.cs`) — **M1, M2 FAIL**

| Bench | Target | Reactor (new) B/render | Pass? |
|---|---:|---:|:---:|
| M1 | ≤ 407 | 1,289 | ❌ (3.2×) |
| M2 | ≤ 1,520 | 3,687 | ❌ (2.4×) |
| M3 | ≤ 19,200 | 8,530 | ✅ |

Note the gate targets were defined as `baseline × 0.4`, but the *measured* ARM64
baselines were ~1,077 / 3,864 / 8,633 — so M1/M2 never had a realistic path to
407/1,520 without the deferred KD-3 binder-check fold + further leaf-alloc work, and
M3's 19,200 target was already cleared at baseline. **The byte gates as written are
not met for M1/M2.** This directly confirms the spec's own KD-3 trigger condition
("fold the M1 leading-`if` binder check … if M1 is still above budget after §4.3/§4.4")
— M1 *is* over budget, so that follow-up is now warranted.

---

## Timing (ns) — captured but NOT comparable cross-baseline

Disregard for ratification. Evidence of environment contamination (identical `Direct`
code, new vs baseline ns): M3 +139%, M4 +130%, M5 +60%, M7 +940µs absolute swing.
Within-run `Reactor`-vs-`Direct` overhead is directionally consistent with baseline
(Reactor adds dispatch cost on M1–M6, wins big on M7/M9 via pooling) but the absolute
numbers are throttled and should be re-captured under §15.5 isolation before any
timing-budget sign-off.

---

## Bottom line for §4.9

- ✅ **Build + capture reproducible on the actual ARM64 baseline box**; allocation is
deterministic and matches baseline `Direct` byte-for-byte.
- ✅ **Most of the V1 path held or improved** vs the captured baseline on allocation
(M2/M3/M4/M5/M6/M8/M9/M10/M11), with a standout **−41% on M9** (keyed list).
- ❌ **Two allocation regressions to fix before claiming the byte-gate pass:**
**M1 +20%** (and 3.2× over its 407 B gate) and **M12 +17%**.
- ⛔ **Not a ratification sign-off:** timing axis is environment-throttled, the
§4.9-mandated randomized/interleaved ordering + CPU-clock telemetry isn't wired,
and the macro suite (L1–L14) can't run (projects deleted). A real §4.9 close needs
an isolated stable-AC re-capture (and the macro suite rebuilt against the single
`Reactor` variant).

_Raw data: `perfbench-controlmodel-{m1-m8,m9,m10-m13}.jsonl` in this folder.
Analysis: `analyze.py`. Baseline: `docs/specs/047/baseline-results/LAPTOP-4MEP83VI/2026-05-25-arm64/`._
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Spec 047 §15.6 (a) — Absolute Comparison

Mean ns per op + alloc bytes, per variant. Columns are dashes when a variant has < min-reps repetitions. Architecture column distinguishes ARM64-native from x64-emulated runs (spec §15.5 — non-comparable across architectures).

| Bench | Arch | Direct ns | Today ns | Reactor ns | Direct alloc | Today alloc | Reactor alloc |
|---|---|---:|---:|---:|---:|---:|---:|
| M1 | Arm64 | 35795.2 | 43113.7 | 43820.5 | 3771877 | 6410214 | 6442971 |
| M10 | Arm64 | 33549.5 | 44664.7 | 43450.0 | 2958312 | 3558654 | 3410949 |
| M11 | Arm64 | 33.9 | 38605.7 | 34239.4 | 40 | 1714131 | 1641088 |
| M12 | Arm64 | 25807.4 | 33245.4 | 34545.0 | 760114 | 1306086 | 1273350 |
| M13 | Arm64 | 35.5 | 106.8 | 220.6 | 24040 | 29373 | 29320 |
| M2 | Arm64 | 50476.4 | 112278.4 | 98428.2 | 13425966 | 18317597 | 18436637 |
| M3 | Arm64 | 422189.5 | 378451.1 | 390740.2 | 28890936 | 41535106 | 42649842 |
| M4 | Arm64 | 76426.5 | 144403.7 | 139003.5 | 4674357 | 10708440 | 9707059 |
| M5 | Arm64 | 34110.0 | 157595.6 | 134839.9 | 4674357 | 10736699 | 9741629 |
| M6 | Arm64 | 43099.6 | 59261.6 | 55663.5 | 3869357 | 4181165 | 4438270 |
| M7 | Arm64 | 1737075.3 | 22028.2 | 21858.9 | 122599664 | 996032 | 1258197 |
| M8 | Arm64 | 7617.1 | 9601.4 | 9258.0 | 915536 | 1807872 | 1807872 |
| M9 | Arm64 | 911654.9 | 3102404.2 | 2483238.8 | 96669803 | 368877754 | 368861339 |
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Spec 047 §15.6 (b) — Reactor Delta (Reactor vs Today)

Positive % = Reactor slower / larger than Today. Negative = improvement. One row per (bench, architecture).

| Bench | Arch | ns delta % | ns 95% CI half-width | alloc delta % |
|---|---|---:|---:|---:|
| M1 | Arm64 | +1.6% | ±5.1% | +0.5% |
| M10 | Arm64 | -2.7% | ±4.0% | -4.2% |
| M11 | Arm64 | -11.3% | ±4.4% | -4.3% |
| M12 | Arm64 | +3.9% | ±8.5% | -2.5% |
| M13 | Arm64 | +106.6% | ±220.6% | -0.2% |
| M2 | Arm64 | -12.3% | ±6.1% | +0.6% |
| M3 | Arm64 | +3.2% | ±23.9% | +2.7% |
| M4 | Arm64 | -3.7% | ±8.8% | -9.4% |
| M5 | Arm64 | -14.4% | ±16.3% | -9.3% |
| M6 | Arm64 | -6.1% | ±9.2% | +6.1% |
| M7 | Arm64 | -0.8% | ±4.8% | +26.3% |
| M8 | Arm64 | -3.6% | ±4.1% | 0.0% |
| M9 | Arm64 | -20.0% | ±23.3% | 0.0% |
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Spec 047 §15.6 (c) — WinUI Gap (Reactor vs Direct)

Absolute overhead Reactor still adds on top of raw WinUI. One row per (bench, architecture).

| Bench | Arch | Reactor ns | Direct ns | Reactor - Direct ns | Reactor alloc - Direct alloc |
|---|---|---:|---:|---:|---:|
| M1 | Arm64 | 43820.5 | 35795.2 | +8025.4 | +2671094 |
| M10 | Arm64 | 43450.0 | 33549.5 | +9900.5 | +452637 |
| M11 | Arm64 | 34239.4 | 33.9 | +34205.5 | +1641048 |
| M12 | Arm64 | 34545.0 | 25807.4 | +8737.6 | +513237 |
| M13 | Arm64 | 220.6 | 35.5 | +185.1 | +5280 |
| M2 | Arm64 | 98428.2 | 50476.4 | +47951.8 | +5010670 |
| M3 | Arm64 | 390740.2 | 422189.5 | -31449.3 | +13758906 |
| M4 | Arm64 | 139003.5 | 76426.5 | +62577.0 | +5032702 |
| M5 | Arm64 | 134839.9 | 34110.0 | +100729.9 | +5067272 |
| M6 | Arm64 | 55663.5 | 43099.6 | +12563.8 | +568914 |
| M7 | Arm64 | 21858.9 | 1737075.3 | -1715216.4 | -121341467 |
| M8 | Arm64 | 9258.0 | 7617.1 | +1640.8 | +892336 |
| M9 | Arm64 | 2483238.8 | 911654.9 | +1571583.9 | +272191536 |
Loading
Loading