Skip to content

Agent-eval: close the Reactor → HTML gap (6 proposals) #226

@codemonkeychris

Description

@codemonkeychris

Context

Agent evals on gpt-5.5 after the work in #225 (plugin-format skills, mur pack-local, dotnet new reactorapp) put reactor-calc at $3.30 / 226 s / 393 K tokens / 11 turns (5-mean), with first build success 5/5 and CV down from 40 % → 8 %. The Reactor / WinUI gap closed to 1.00× tokens, 0.62× wall on calc and 1.20× tokens, 0.72× wall on kanban — Reactor is now faster than XAML for equal cost.

The remaining ceiling is HTML at $1.68 / 88 s for calc and $1.98 / 170 s for kanban. The Reactor → HTML gap is ~3× tokens, ~2.5× wall.

eval wall (CV) turns (CV) tokens (CV) cost USD LoC first build
html-calc 88 s (33 %) 5.6 131 K $1.68 305 n/a
html-kanban 170 s (27 %) 6.6 183 K $1.98 703 n/a
reactor-calc 226 s (8 %) 11.0 393 K $3.30 163 5/5 ✓
reactor-kanban 471 s (21 %) 16.8 738 K $5.04 277 5/5 ✓
winui-xaml-calc 363 s (62 %) 11.2 394 K $3.30 334 n/a
winui-xaml-kanban 651 s (31 %) 13.6 615 K $4.56 590 n/a

The structural overhead in Reactor is ~270 K tokens per kanban run vs HTML's ~130 K. That's the remaining gap.

Where the gap actually lives (event-log evidence)

A typical Reactor-kanban run breaks down as:

category turns tokens (est.)
skill load 1 5 K
dotnet new + scaffold inspect 2-3 25 K
Sample-app reads (drag, dialog, flyout, context) 3-5 80 K
reactor.api.txt ripgreps 4-8 60 K
Apply-patch implementation 3-4 80 K
Build + fix cycles 2-4 150 K
total ~17 ~400 K

HTML's equivalent: ~6 turns, ~130 K tokens, no scaffold, no API exploration, no build cycles.

Proposals (ranked by predicted impact × feasibility)

[ ] 1. Bigger, richer dotnet new reactorapp template — predicted ~−25 % tokens

Today the template generates a 12-line counter. Make it a multi-component app with the shapes the agent needs:

  • App root with UseReducer + a typed record state
  • A Component<TProps> child
  • A .Provide(Ctx, ...) example
  • // REPLACE WITH YOUR LOGIC comments at all the right places

Rationale: The agent reads the scaffold once (cheap, single `view`) and gets all structural patterns in-workspace. WinUI rarely loads its design skill body because `dotnet new winui-mvvm`'s 30+ files of MVVM scaffolding is the documentation. Net est: save ~3-4 turns × ~30 K context = ~120 K tokens per kanban run.

First step: Edit `tools/Templates/templates/WinUIApp-CSharp/` to add a `Components/` directory, a `Models.cs` with a record state, and inline-comment guidance.

[ ] 2. Inline a generated cheatsheet into the workspace at scaffold time — predicted −10-15 % tokens

When `dotnet new reactorapp` runs, drop `_reactor-api-cheatsheet.md` (top 80% factories/modifiers/hooks, ~100 lines) alongside `Program.cs`. Auto-include in the csproj as `` so build ignores it.

Rationale: Cost-of-context per turn is the dominant lever. Moving signatures from "ripgrep'd 6-8× per session" to "one tiny file read at scaffold time" is a 2-3× compression of that overhead. Generate from the same source as `reactor.api.txt` to avoid drift.

First step: Add cheatsheet emission to `tools/Reactor.SignaturesGen/Program.cs` (already writes `reactor.api.txt` to two paths; add a third for the cheatsheet) + reference from the template.

[ ] 5. Make mur check the default verification, not dotnet build — predicted −10-20 % tokens

Currently the agent runs `dotnet build`, which dumps 1.5-3 K tokens of MSBuild output per build. `mur check` returns ~50-150 tokens with skill-file pointers. With 2-4 build cycles per session at 17 turns, ~150 K cache reads saved per run (9 K saved per turn × 17 turns).

Two-tier: `mur check` first, `dotnet build` only if `check` is clean but a deeper compile error is suspected.

First step: Update the eval prompt (`evals/lib/flavor-reactor.ts`) and the `reactor-build-and-check` skill to lead with `mur check`.

Recommended cut: implement #1 + #2 + #5 together — independent and complementary. Predicted cumulative: kanban tokens 738 K → ~480 K, cost $5.04 → $3.30. Even at 2× pessimism, ~600 K is reachable — putting Reactor solidly under WinUI's 615 K for the first time.

[ ] 3. Defer-everything skill loading — predicted −5-10 % tokens, +5 % build-failure risk

Drop the always-loaded `reactor-getting-started` body. Keep only a 200-token stub that points the agent at `skill reactor-getting-started` on demand. Cache reads are paid every turn; a 5 K-token always-loaded skill costs 85 K cache reads per kanban run.

Risk: First-build success rate may regress if the agent guesses API names without skill reference. Don't ship if first-build OK rate drops below 90 % — A/B 5×N batch first.

[ ] 4. "Generate then port" — skill-directed two-pass authoring — medium win, +1-2 turns of cheaper turns

Skill text directs the agent: for any non-trivial UI, sketch the component tree in JSX-like pseudocode in your head, then translate component-by-component to Reactor C# (1 line of pseudocode → 1 line of Reactor). The Rosetta-stone table is already there; emphasize the pseudocode-first workflow so React priors carry the design phase.

First step: Add to `reactor-getting-started`: a "## Authoring workflow — design in React, write in C#" section with one worked 5-line React → 8-line Reactor example.

[ ] 6. C# / DSL ergonomics — small win on output tokens; longer-term lever

a. Components-as-records: `record App() : Component { override Render() => ... }` — saves ~15 chars per component, marginal but compounds.
b. Verify implicit `using static Microsoft.UI.Reactor.Factories` is in the template's `GlobalUsings.cs` and document it in the skill (so the agent doesn't redundantly add it). Documentation only — do this now.
c. `UseState`-as-property syntax (long-term framework work): `[Stateful] partial class App { State Count = 0; }` — needs source generator + analyzer support. Defer.

Honest ceiling

Hard floor is set by: one build cycle (~50-80 K tokens HTML doesn't pay), some skill content load on first attempt (~30-60 K one-time), occasional ripgrep of less-common patterns (~10-30 K). Lower-bound estimate with all 6 ideas: kanban ~250-300 K tokens (vs HTML's 183 K). ~1.5× HTML is about as close as we can realistically land while keeping correctness checks.

Why "generate HTML, then port" doesn't pencil out: HTML pass + port pass ≈ 380 K tokens / $5.50 / 340 s — tokens improve, cost ties, and you carry porting-fidelity risk. The mental version of this (idea #4) captures most of the benefit without the second pass.

Aim: match WinUI XAML on cost (done — at 1.00× / 1.11×), be 1.5×–2× HTML as the realistic ceiling.

Pointers

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions