Agent-eval: close the Reactor → HTML gap (6 proposals)

## Context

Agent evals on `gpt-5.5` after the work in #225 (plugin-format skills, `mur pack-local`, `dotnet new reactorapp`) put **`reactor-calc`** at $3.30 / 226 s / 393 K tokens / 11 turns (5-mean), with **first build success 5/5** and CV down from 40 % → 8 %. The Reactor / WinUI gap closed to **1.00× tokens, 0.62× wall** on calc and **1.20× tokens, 0.72× wall** on kanban — Reactor is now **faster than XAML for equal cost**.

The remaining ceiling is **HTML** at $1.68 / 88 s for calc and $1.98 / 170 s for kanban. The Reactor → HTML gap is **~3× tokens, ~2.5× wall**.

| eval | wall (CV) | turns (CV) | tokens (CV) | cost USD | LoC | first build |
|---|---:|---:|---:|---:|---:|:---:|
| html-calc | 88 s (33 %) | 5.6 | 131 K | $1.68 | 305 | n/a |
| html-kanban | 170 s (27 %) | 6.6 | 183 K | $1.98 | 703 | n/a |
| reactor-calc | 226 s (8 %) | 11.0 | 393 K | $3.30 | 163 | 5/5 ✓ |
| reactor-kanban | 471 s (21 %) | 16.8 | 738 K | $5.04 | 277 | 5/5 ✓ |
| winui-xaml-calc | 363 s (62 %) | 11.2 | 394 K | $3.30 | 334 | n/a |
| winui-xaml-kanban | 651 s (31 %) | 13.6 | 615 K | $4.56 | 590 | n/a |

The structural overhead in Reactor is **~270 K tokens per kanban run** vs HTML's ~130 K. That's the remaining gap.

## Where the gap actually lives (event-log evidence)

A typical Reactor-kanban run breaks down as:

| category | turns | tokens (est.) |
|---|---:|---:|
| skill load | 1 | 5 K |
| `dotnet new` + scaffold inspect | 2-3 | 25 K |
| Sample-app reads (drag, dialog, flyout, context) | 3-5 | 80 K |
| `reactor.api.txt` ripgreps | 4-8 | 60 K |
| Apply-patch implementation | 3-4 | 80 K |
| Build + fix cycles | 2-4 | 150 K |
| **total** | **~17** | **~400 K** |

HTML's equivalent: ~6 turns, ~130 K tokens, no scaffold, no API exploration, no build cycles.

## Proposals (ranked by predicted impact × feasibility)

### [ ] 1. Bigger, richer `dotnet new reactorapp` template — predicted ~−25 % tokens

Today the template generates a 12-line counter. Make it a multi-component app with the *shapes* the agent needs:

- `App` root with `UseReducer` + a typed `record` state
- A `Component<TProps>` child
- A `.Provide(Ctx, ...)` example
- `// REPLACE WITH YOUR LOGIC` comments at all the right places

**Rationale:** The agent reads the scaffold once (cheap, single \`view\`) and gets all structural patterns in-workspace. WinUI rarely loads its design skill body because \`dotnet new winui-mvvm\`'s 30+ files of MVVM scaffolding *is* the documentation. Net est: save ~3-4 turns × ~30 K context = ~120 K tokens per kanban run.

**First step:** Edit \`tools/Templates/templates/WinUIApp-CSharp/\` to add a \`Components/\` directory, a \`Models.cs\` with a record state, and inline-comment guidance.

### [ ] 2. Inline a generated cheatsheet *into the workspace at scaffold time* — predicted −10-15 % tokens

When \`dotnet new reactorapp\` runs, drop \`_reactor-api-cheatsheet.md\` (top 80% factories/modifiers/hooks, ~100 lines) alongside \`Program.cs\`. Auto-include in the csproj as \`<None>\` so build ignores it.

**Rationale:** Cost-of-context per turn is the dominant lever. Moving signatures from "ripgrep'd 6-8× per session" to "one tiny file read at scaffold time" is a 2-3× compression of that overhead. Generate from the same source as \`reactor.api.txt\` to avoid drift.

**First step:** Add cheatsheet emission to \`tools/Reactor.SignaturesGen/Program.cs\` (already writes \`reactor.api.txt\` to two paths; add a third for the cheatsheet) + reference from the template.

### [ ] 5. Make `mur check` the default verification, not `dotnet build` — predicted −10-20 % tokens

Currently the agent runs \`dotnet build\`, which dumps 1.5-3 K tokens of MSBuild output per build. \`mur check\` returns ~50-150 tokens with skill-file pointers. With 2-4 build cycles per session at 17 turns, **~150 K cache reads saved per run** (9 K saved per turn × 17 turns).

Two-tier: \`mur check\` first, \`dotnet build\` only if \`check\` is clean but a deeper compile error is suspected.

**First step:** Update the eval prompt (\`evals/lib/flavor-reactor.ts\`) and the \`reactor-build-and-check\` skill to lead with \`mur check\`.

> **Recommended cut**: implement #1 + #2 + #5 together — independent and complementary. Predicted cumulative: kanban tokens 738 K → ~480 K, cost $5.04 → $3.30. Even at 2× pessimism, ~600 K is reachable — putting Reactor solidly under WinUI's 615 K for the first time.

### [ ] 3. Defer-everything skill loading — predicted −5-10 % tokens, +5 % build-failure risk

Drop the always-loaded \`reactor-getting-started\` body. Keep only a 200-token stub that points the agent at \`skill reactor-getting-started\` on demand. Cache reads are paid every turn; a 5 K-token always-loaded skill costs 85 K cache reads per kanban run.

**Risk:** First-build success rate may regress if the agent guesses API names without skill reference. **Don't ship if first-build OK rate drops below 90 %** — A/B 5×N batch first.

### [ ] 4. "Generate then port" — skill-directed two-pass authoring — medium win, +1-2 turns of cheaper turns

Skill text directs the agent: *for any non-trivial UI, sketch the component tree in JSX-like pseudocode in your head, then translate component-by-component to Reactor C# (1 line of pseudocode → 1 line of Reactor)*. The Rosetta-stone table is already there; emphasize the *pseudocode-first* workflow so React priors carry the design phase.

**First step:** Add to \`reactor-getting-started\`: a "## Authoring workflow — design in React, write in C#" section with one worked 5-line React → 8-line Reactor example.

### [ ] 6. C# / DSL ergonomics — small win on output tokens; longer-term lever

a. **Components-as-records:** \`record App() : Component { override Render() => ... }\` — saves ~15 chars per component, marginal but compounds.
b. **Verify implicit \`using static Microsoft.UI.Reactor.Factories\`** is in the template's \`GlobalUsings.cs\` and document it in the skill (so the agent doesn't redundantly add it). **Documentation only — do this now.**
c. **\`UseState<T>\`-as-property syntax** (long-term framework work): \`[Stateful] partial class App { State<int> Count = 0; }\` — needs source generator + analyzer support. Defer.

## Honest ceiling

**Hard floor is set by:** one build cycle (~50-80 K tokens HTML doesn't pay), some skill content load on first attempt (~30-60 K one-time), occasional ripgrep of less-common patterns (~10-30 K). Lower-bound estimate with all 6 ideas: kanban ~250-300 K tokens (vs HTML's 183 K). **~1.5× HTML is about as close as we can realistically land** while keeping correctness checks.

**Why "generate HTML, then port" doesn't pencil out:** HTML pass + port pass ≈ 380 K tokens / $5.50 / 340 s — tokens improve, cost ties, and you carry porting-fidelity risk. The mental version of this (idea #4) captures most of the benefit without the second pass.

**Aim:** match WinUI XAML on cost (done — at 1.00× / 1.11×), be 1.5×–2× HTML as the realistic ceiling.

## Pointers

- Source for these numbers: \`C:\Users\andersonch\Code\TokenCountTest\REPORT-2026-05-09.md\` (5×6 batch on \`gpt-5.5\`, all 30 runs successful)
- Plugin-format skills + local NuGet flow + \`dotnet new reactorapp\`: #225 (merged)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent-eval: close the Reactor → HTML gap (6 proposals) #226

Context

Where the gap actually lives (event-log evidence)

Proposals (ranked by predicted impact × feasibility)

[ ] 1. Bigger, richer `dotnet new reactorapp` template — predicted ~−25 % tokens

[ ] 2. Inline a generated cheatsheet into the workspace at scaffold time — predicted −10-15 % tokens

[ ] 5. Make `mur check` the default verification, not `dotnet build` — predicted −10-20 % tokens

[ ] 3. Defer-everything skill loading — predicted −5-10 % tokens, +5 % build-failure risk

[ ] 4. "Generate then port" — skill-directed two-pass authoring — medium win, +1-2 turns of cheaper turns

[ ] 6. C# / DSL ergonomics — small win on output tokens; longer-term lever

Honest ceiling

Pointers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

eval	wall (CV)	turns (CV)	tokens (CV)	cost USD	LoC	first build
html-calc	88 s (33 %)	5.6	131 K	$1.68	305	n/a
html-kanban	170 s (27 %)	6.6	183 K	$1.98	703	n/a
reactor-calc	226 s (8 %)	11.0	393 K	$3.30	163	5/5 ✓
reactor-kanban	471 s (21 %)	16.8	738 K	$5.04	277	5/5 ✓
winui-xaml-calc	363 s (62 %)	11.2	394 K	$3.30	334	n/a
winui-xaml-kanban	651 s (31 %)	13.6	615 K	$4.56	590	n/a

category	turns	tokens (est.)
skill load	1	5 K
`dotnet new` + scaffold inspect	2-3	25 K
Sample-app reads (drag, dialog, flyout, context)	3-5	80 K
`reactor.api.txt` ripgreps	4-8	60 K
Apply-patch implementation	3-4	80 K
Build + fix cycles	2-4	150 K
total	~17	~400 K

Agent-eval: close the Reactor → HTML gap (6 proposals) #226

Description

Context

Where the gap actually lives (event-log evidence)

Proposals (ranked by predicted impact × feasibility)

[ ] 1. Bigger, richer dotnet new reactorapp template — predicted ~−25 % tokens

[ ] 2. Inline a generated cheatsheet into the workspace at scaffold time — predicted −10-15 % tokens

[ ] 5. Make mur check the default verification, not dotnet build — predicted −10-20 % tokens

[ ] 3. Defer-everything skill loading — predicted −5-10 % tokens, +5 % build-failure risk

[ ] 4. "Generate then port" — skill-directed two-pass authoring — medium win, +1-2 turns of cheaper turns

[ ] 6. C# / DSL ergonomics — small win on output tokens; longer-term lever

Honest ceiling

Pointers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[ ] 1. Bigger, richer `dotnet new reactorapp` template — predicted ~−25 % tokens

[ ] 5. Make `mur check` the default verification, not `dotnet build` — predicted −10-20 % tokens