Context
Issue #226 §5 proposes making mur check the default verification path in place of dotnet build — a thin MSBuild wrapper that returns ~50–150 tokens with skill-file pointers per diagnostic instead of 1.5–3 K tokens of build output. This proposal extends that idea: turn mur check into a diagnostic-aware coach that augments compiler errors landing on Reactor types with concrete did-you-mean suggestions.
Motivating example, from the Phase-7 eval traces summarized in #226:
Program.cs(34,16): error CS1061: 'ButtonElement' does not contain a definition for 'OnClick'
After this proposal:
Program.cs:34:16 E CS1061 'ButtonElement' has no member 'OnClick'
-> try: Button(label, onClick: x) [factory has Action onClick parameter]
The change converts a 2–4-turn build-fix cycle (~150 K tokens of context) into a 1-turn correction. The 6-proposal cut in #226 already attacks scaffold and skill-load overhead; this attacks the build/fix loop overhead more aggressively than #226 §5 alone.
Goal
For every C# compiler error (CS-prefixed) whose receiver, type, or member references a Microsoft.UI.Reactor.* symbol, emit a single-line suggestion with confidence ≥ T. Stay silent below T — a wrong suggestion is worse than no suggestion because the agent will trust it and burn turns.
Non-goals:
- Suggestions for non-Reactor compile errors (NuGet, target framework, etc.).
- Auto-fix / write-back to source — emit text only; the agent edits.
- Replacing the existing analyzers (REACTOR_*) — those keep their current pipeline.
Architecture: four tiers, ML is the last tier
mur check <path>
|
v spawn dotnet build, parse diagnostics (existing - CheckCommand.cs)
|
v for each diagnostic that references a Microsoft.UI.Reactor.* symbol:
|
| Tier 1 - analyzer-ID hint table (existing - HintFor() in CheckCommand)
| REACTOR_HOOKS_001 -> SKILL.md §Hooks
| 12 IDs covered today.
|
| Tier 2 - Roslyn semantic suggester (NEW)
| Load Compilation, resolve symbol at span,
| fuzzy-match against in-scope members + factory set.
| Handles CS1061, CS0103, CS0117, CS1503, CS7036.
|
| Tier 3 - pattern rules (React-ism reductions) (NEW)
| .OnClick(x) -> onClick: x
| .Style(...) -> .With(...) modifier chain
| .Children([...]) -> factory(elements[]) positional
| className= -> not a thing; surface modifier API
|
| Tier 4 - confidence ranker / tiebreaker (FUTURE - only if Tier 2+3 leave gaps)
| GBDT over hand-engineered features (Levenshtein,
| param-name overlap, factory-popularity-in-samples,
| AST-shape similarity).
|
v append "-> try: <suggestion>" only when confidence >= T
The tier structure is deliberate: most of the value is deterministic. ML only enters as a tiebreaker, and only after we have measured data showing where Tier 2+3 fall short.
Data-generation strategy: mine the eval harness
To pick the right pattern rules and to set per-rule confidence thresholds, we need a frequency-ranked list of the mistakes the agent actually makes. Do not guess — measure. Two complementary streams:
(a) Trace mining from existing evals
Every eval run already produces an event log (build failures, source diffs, eventual success). Add an offline extractor that walks each transcript and emits (broken_source, diagnostic, fixed_source) triples whenever a failed build is followed by a successful one in the same session. The agent's eventual fix is, by construction, a fix that compiled — which is the closest thing to ground truth we can get without hand-labeling.
Output schema:
{ run_id, turn, file, span,
diag_code, diag_msg,
before_text, after_text,
fix_kind: "renamed_member" | "fluent_to_named" | "overload_swap" | ... }
Rule induction is then "cluster by diag_code + AST-shape-of-before and pick the dominant transformation per cluster."
(b) Random-app generator harness (the missing piece — proposed here)
The eval corpus today is two scenarios (calc, kanban). To bootstrap rules with statistical confidence we need broader coverage. Build a generator that emits prompts spanning the control × interaction × layout × theming space, e.g.:
build me a {single-page app | tabbed app} that uses a {Button | CheckBox | ComboBox | DataGrid | ...}
to {trigger | bind to | filter | edit} a {string | int | enum | record | list-of-records},
laid out in a {Grid | HStack | VStack | Flex}, themed with {default | dark | accent variant}.
A few hundred prompts ≈ a few hundred fresh transcripts ≈ a few thousand (broken, fixed) pairs after dedup. Crucially:
- Prompts can be enumerated combinatorially or LLM-rewritten for natural variation.
- Each run is a black-box call into the existing eval harness — no harness changes needed beyond "capture the trace."
- Cost is bounded and predictable: ~$3–5 per run × 500 runs ≈ $1.5–2.5 K of one-shot training corpus.
The user observation that motivates this is correct: we don't have direct production frequency yet, but the random-app corpus is enough to seed the pattern rules. Production frequency comes later from telemetry on mur check itself (count which suggestions fire, which the agent accepts, which it overrides).
(c) (Optional) Synthetic mutations as a third stream
Take samples in samples/apps/ (13 apps as of today — a11y-showcase, chat, monaco-editor, regedit, wordpuzzle, ...), apply rule-based corruptions (rename methods, fluentize named args, swap overloads), pair (broken, fix). Cheap to scale to 100K pairs but distribution is artificial. Use as a validation set, not training set, to keep training honest.
Implementation plan
Phased so each phase is independently shippable.
Phase 0 — instrumentation (no behavior change)
- Extend
CheckCommand to emit a structured trace (mur check --trace <path>) capturing every CS-diagnostic that resolves to a Microsoft.UI.Reactor.* symbol, even when no suggestion fires today.
- Land the trace-mining script that walks an eval's event log to extract (broken, fixed) pairs.
- Run a 50×N sweep on existing calc + kanban prompts. Output: a frequency-ranked list of CS codes + receiver-types that touch Reactor.
Exit criterion: we have a ranked list of the top ~20 CS-diagnostic patterns the agent hits.
Phase 1 — Tier 2 (Roslyn semantic suggester)
- Add
Reactor.Cli.Check.Suggesters.SymbolSuggester. Inputs: Compilation, the Diagnostic, the SyntaxNode at the span. Outputs: ordered list of (suggestion_text, confidence, evidence).
- Cover CS1061 (member missing), CS0103 (name not in scope), CS0117 (no static member), CS1503 (argument type mismatch), CS7036 (no overload).
- Fuzzy-match using JaroWinkler against:
- in-scope members of the receiver's
ITypeSymbol,
- the factory set in
Microsoft.UI.Reactor.Factories,
- parameter names of all overloads of the enclosing factory call.
Microsoft.CodeAnalysis.CSharp 4.8.0 is already a PackageReference in src/Reactor.Cli/Reactor.Cli.csproj, so no dependency churn.
- Compilation cost: load once per
mur check invocation, cache by <csproj, file-hash> for incremental scenarios.
Exit criterion: for the top ~20 patterns from Phase 0, ≥ 70 % of cases get a correct suggestion at confidence ≥ T, with false-positive rate ≤ 5 %.
Phase 2 — random-app generator harness
- Add
tools/Reactor.PromptGen/ that enumerates the prompt grid above and writes a JSONL of prompts.
- Reuse the existing eval runner; add a
--mine-fixes mode that extracts pairs into tools/Reactor.PromptGen/data/fixes.jsonl.
- Run 200–500 prompts. Aggregate to a
patterns.json ranked by frequency.
Exit criterion: patterns.json covers ≥ 95 % of the (broken, fixed) pairs by frequency (long-tail truncated).
Phase 3 — Tier 3 (pattern rules, induced)
- For each high-frequency cluster, hand-author a rule (
IRulePattern) matched against the failing AST and the diagnostic code.
- Rules live in
src/Reactor.Cli/Check/Rules/*.cs, one per pattern, each with a unit test against a captured pair.
Exit criterion: Tier 3 catches every cluster with frequency ≥ 5 % in the random-app corpus.
Phase 4 — telemetry-driven Tier 4 (only if needed)
- Log
(diagnostic, candidates, picked, accepted-by-agent) from production mur check invocations.
- If Tier 2+3 still leave a meaningful tail, train a small GBDT ranker over hand-engineered features. Defer until data justifies it.
Risks
| Risk |
Mitigation |
| Wrong suggestions corrupt the agent's reasoning |
Per-rule confidence threshold; emit only above T. Telemetry on agent-accept rate per rule, auto-suppress rules with low accept. |
Roslyn Compilation load is slow on cold runs |
Single load per mur check invocation, ~200–500 ms one-time cost vs. multi-second dotnet build we're already paying. |
reactor.api.txt and the live Reactor.dll drift apart |
Tier 2 reads from the live Compilation, not the api index. The api index is a fast pre-filter only. |
| Random-app corpus over-represents simple controls |
Stratify the prompt grid; weight rules by both frequency and per-cluster turn-cost (rare patterns that cost 5 turns matter more than common ones that cost 1). |
| Pattern rules become a maintenance treadmill |
Auto-generate the samples/-derived validation set; CI fails when an analyzer change regresses a captured pair. |
Predicted impact
If we accept #226's per-kanban-run breakdown (build + fix cycles ≈ 150 K tokens, 2–4 turns), and Tier 2+3 remove ~70 % of those turns:
- Tokens: ~−100 K per kanban run (~14 % of total).
- Turns: ~−2 (16.8 → ~14.8).
- Cost: ~−$0.70 per run.
Stacking with #226 §1 (richer template) + §2 (inline cheatsheet): kanban ~480 K → ~380 K tokens, putting Reactor decisively under WinUI on cost and within ~2× HTML.
Open questions
- Trace format for the random-app generator: does the existing eval harness produce a transcript rich enough to extract before/after source pairs without re-running the build? If not, what is the smallest harness change?
- Confidence threshold T: start strict (≥ 0.85 JaroWinkler-equivalent) and loosen with telemetry, or start loose and tighten? Defaulting to strict preserves "silent is OK" — propose strict.
- Should Tier 2 also rewrite the surface form? Today suggestions are text-only ("try:
Button(label, onClick: x)"). A future variant could emit a unified-diff hunk; but that is a separate scope.
- Roslyn workspaces vs. compilations: workspaces give us project-graph reasoning but cost more to load. Start with
Compilation only; revisit if rules need cross-project context.
Pointers
- Existing
mur check implementation: src/Reactor.Cli/Check/CheckCommand.cs
- Existing analyzers (12 REACTOR_* IDs):
src/Reactor.Analyzers/{HookRulesAnalyzer,UseMemoCellsAnalyzer,UseThemeRefAnalyzer,RequestedThemeSetAnalyzer,UseLightweightStylingAnalyzer,AccessibilityAnalyzers,MissingWithKeyAnalyzer}.cs
- API surface index generator:
tools/Reactor.SignaturesGen/Program.cs → writes skills/reactor.api.txt + plugins/reactor/skills/reactor-dsl/references/reactor.api.txt
- Build/check skill (cheat table for known IDs):
plugins/reactor/skills/reactor-build-and-check/SKILL.md
- Sample apps (training/validation corpus):
samples/apps/ (13 projects)
- Roslyn version available today:
Microsoft.CodeAnalysis.CSharp 4.8.0 in src/Reactor.Cli/Reactor.Cli.csproj
Extends and depends on #226 §5. Independent of, but complementary to, #226 §1, §2.
Context
Issue #226 §5 proposes making
mur checkthe default verification path in place ofdotnet build— a thin MSBuild wrapper that returns ~50–150 tokens with skill-file pointers per diagnostic instead of 1.5–3 K tokens of build output. This proposal extends that idea: turnmur checkinto a diagnostic-aware coach that augments compiler errors landing on Reactor types with concrete did-you-mean suggestions.Motivating example, from the Phase-7 eval traces summarized in #226:
After this proposal:
The change converts a 2–4-turn build-fix cycle (~150 K tokens of context) into a 1-turn correction. The 6-proposal cut in #226 already attacks scaffold and skill-load overhead; this attacks the build/fix loop overhead more aggressively than #226 §5 alone.
Goal
For every C# compiler error (CS-prefixed) whose receiver, type, or member references a
Microsoft.UI.Reactor.*symbol, emit a single-line suggestion with confidence ≥ T. Stay silent below T — a wrong suggestion is worse than no suggestion because the agent will trust it and burn turns.Non-goals:
Architecture: four tiers, ML is the last tier
The tier structure is deliberate: most of the value is deterministic. ML only enters as a tiebreaker, and only after we have measured data showing where Tier 2+3 fall short.
Data-generation strategy: mine the eval harness
To pick the right pattern rules and to set per-rule confidence thresholds, we need a frequency-ranked list of the mistakes the agent actually makes. Do not guess — measure. Two complementary streams:
(a) Trace mining from existing evals
Every eval run already produces an event log (build failures, source diffs, eventual success). Add an offline extractor that walks each transcript and emits (broken_source, diagnostic, fixed_source) triples whenever a failed build is followed by a successful one in the same session. The agent's eventual fix is, by construction, a fix that compiled — which is the closest thing to ground truth we can get without hand-labeling.
Output schema:
Rule induction is then "cluster by
diag_code+ AST-shape-of-beforeand pick the dominant transformation per cluster."(b) Random-app generator harness (the missing piece — proposed here)
The eval corpus today is two scenarios (calc, kanban). To bootstrap rules with statistical confidence we need broader coverage. Build a generator that emits prompts spanning the control × interaction × layout × theming space, e.g.:
A few hundred prompts ≈ a few hundred fresh transcripts ≈ a few thousand (broken, fixed) pairs after dedup. Crucially:
The user observation that motivates this is correct: we don't have direct production frequency yet, but the random-app corpus is enough to seed the pattern rules. Production frequency comes later from telemetry on
mur checkitself (count which suggestions fire, which the agent accepts, which it overrides).(c) (Optional) Synthetic mutations as a third stream
Take samples in
samples/apps/(13 apps as of today —a11y-showcase,chat,monaco-editor,regedit,wordpuzzle, ...), apply rule-based corruptions (rename methods, fluentize named args, swap overloads), pair (broken, fix). Cheap to scale to 100K pairs but distribution is artificial. Use as a validation set, not training set, to keep training honest.Implementation plan
Phased so each phase is independently shippable.
Phase 0 — instrumentation (no behavior change)
CheckCommandto emit a structured trace (mur check --trace <path>) capturing every CS-diagnostic that resolves to aMicrosoft.UI.Reactor.*symbol, even when no suggestion fires today.Exit criterion: we have a ranked list of the top ~20 CS-diagnostic patterns the agent hits.
Phase 1 — Tier 2 (Roslyn semantic suggester)
Reactor.Cli.Check.Suggesters.SymbolSuggester. Inputs:Compilation, theDiagnostic, theSyntaxNodeat the span. Outputs: ordered list of(suggestion_text, confidence, evidence).ITypeSymbol,Microsoft.UI.Reactor.Factories,Microsoft.CodeAnalysis.CSharp4.8.0 is already aPackageReferenceinsrc/Reactor.Cli/Reactor.Cli.csproj, so no dependency churn.mur checkinvocation, cache by<csproj, file-hash>for incremental scenarios.Exit criterion: for the top ~20 patterns from Phase 0, ≥ 70 % of cases get a correct suggestion at confidence ≥ T, with false-positive rate ≤ 5 %.
Phase 2 — random-app generator harness
tools/Reactor.PromptGen/that enumerates the prompt grid above and writes a JSONL of prompts.--mine-fixesmode that extracts pairs intotools/Reactor.PromptGen/data/fixes.jsonl.patterns.jsonranked by frequency.Exit criterion:
patterns.jsoncovers ≥ 95 % of the (broken, fixed) pairs by frequency (long-tail truncated).Phase 3 — Tier 3 (pattern rules, induced)
IRulePattern) matched against the failing AST and the diagnostic code.src/Reactor.Cli/Check/Rules/*.cs, one per pattern, each with a unit test against a captured pair.Exit criterion: Tier 3 catches every cluster with frequency ≥ 5 % in the random-app corpus.
Phase 4 — telemetry-driven Tier 4 (only if needed)
(diagnostic, candidates, picked, accepted-by-agent)from productionmur checkinvocations.Risks
Compilationload is slow on cold runsmur checkinvocation, ~200–500 ms one-time cost vs. multi-seconddotnet buildwe're already paying.reactor.api.txtand the liveReactor.dlldrift apartCompilation, not the api index. The api index is a fast pre-filter only.samples/-derived validation set; CI fails when an analyzer change regresses a captured pair.Predicted impact
If we accept #226's per-kanban-run breakdown (build + fix cycles ≈ 150 K tokens, 2–4 turns), and Tier 2+3 remove ~70 % of those turns:
Stacking with #226 §1 (richer template) + §2 (inline cheatsheet): kanban ~480 K → ~380 K tokens, putting Reactor decisively under WinUI on cost and within ~2× HTML.
Open questions
Button(label, onClick: x)"). A future variant could emit a unified-diff hunk; but that is a separate scope.Compilationonly; revisit if rules need cross-project context.Pointers
mur checkimplementation:src/Reactor.Cli/Check/CheckCommand.cssrc/Reactor.Analyzers/{HookRulesAnalyzer,UseMemoCellsAnalyzer,UseThemeRefAnalyzer,RequestedThemeSetAnalyzer,UseLightweightStylingAnalyzer,AccessibilityAnalyzers,MissingWithKeyAnalyzer}.cstools/Reactor.SignaturesGen/Program.cs→ writesskills/reactor.api.txt+plugins/reactor/skills/reactor-dsl/references/reactor.api.txtplugins/reactor/skills/reactor-build-and-check/SKILL.mdsamples/apps/(13 projects)Microsoft.CodeAnalysis.CSharp4.8.0 insrc/Reactor.Cli/Reactor.Cli.csprojExtends and depends on #226 §5. Independent of, but complementary to, #226 §1, §2.