Design proposal: Roslyn-backed did-you-mean suggestions for mur check (extends #226 §5)

## Context

Issue #226 §5 proposes making `mur check` the default verification path in place of `dotnet build` — a thin MSBuild wrapper that returns ~50–150 tokens with skill-file pointers per diagnostic instead of 1.5–3 K tokens of build output. This proposal extends that idea: turn `mur check` into a **diagnostic-aware coach** that augments compiler errors landing on Reactor types with concrete *did-you-mean* suggestions.

Motivating example, from the Phase-7 eval traces summarized in #226:

```
Program.cs(34,16): error CS1061: 'ButtonElement' does not contain a definition for 'OnClick'
```

After this proposal:

```
Program.cs:34:16  E  CS1061  'ButtonElement' has no member 'OnClick'
                              -> try: Button(label, onClick: x)         [factory has Action onClick parameter]
```

The change converts a 2–4-turn build-fix cycle (~150 K tokens of context) into a 1-turn correction. The 6-proposal cut in #226 already attacks scaffold and skill-load overhead; this attacks the **build/fix loop overhead** more aggressively than #226 §5 alone.

## Goal

For every C# compiler error (CS-prefixed) whose receiver, type, or member references a `Microsoft.UI.Reactor.*` symbol, emit a single-line suggestion with confidence ≥ T. Stay silent below T — a wrong suggestion is worse than no suggestion because the agent will trust it and burn turns.

Non-goals:
- Suggestions for non-Reactor compile errors (NuGet, target framework, etc.).
- Auto-fix / write-back to source — emit text only; the agent edits.
- Replacing the existing analyzers (REACTOR_*) — those keep their current pipeline.

## Architecture: four tiers, ML is the *last* tier

```
mur check <path>
    |
    v spawn dotnet build, parse diagnostics  (existing - CheckCommand.cs)
    |
    v for each diagnostic that references a Microsoft.UI.Reactor.* symbol:
    |
    |   Tier 1 - analyzer-ID hint table              (existing - HintFor() in CheckCommand)
    |             REACTOR_HOOKS_001 -> SKILL.md §Hooks
    |             12 IDs covered today.
    |
    |   Tier 2 - Roslyn semantic suggester           (NEW)
    |             Load Compilation, resolve symbol at span,
    |             fuzzy-match against in-scope members + factory set.
    |             Handles CS1061, CS0103, CS0117, CS1503, CS7036.
    |
    |   Tier 3 - pattern rules (React-ism reductions) (NEW)
    |             .OnClick(x)        -> onClick: x
    |             .Style(...)        -> .With(...) modifier chain
    |             .Children([...])   -> factory(elements[])  positional
    |             className=         -> not a thing; surface modifier API
    |
    |   Tier 4 - confidence ranker / tiebreaker      (FUTURE - only if Tier 2+3 leave gaps)
    |             GBDT over hand-engineered features (Levenshtein,
    |             param-name overlap, factory-popularity-in-samples,
    |             AST-shape similarity).
    |
    v append "-> try: <suggestion>" only when confidence >= T
```

The tier structure is deliberate: **most of the value is deterministic.** ML only enters as a tiebreaker, and only after we have measured data showing where Tier 2+3 fall short.

## Data-generation strategy: mine the eval harness

To pick the right pattern rules and to set per-rule confidence thresholds, we need a frequency-ranked list of the mistakes the agent actually makes. **Do not guess — measure.** Two complementary streams:

### (a) Trace mining from existing evals

Every eval run already produces an event log (build failures, source diffs, eventual success). Add an offline extractor that walks each transcript and emits (broken_source, diagnostic, fixed_source) triples whenever a failed build is followed by a successful one in the same session. The agent's eventual fix is, by construction, a fix that compiled — which is the closest thing to ground truth we can get without hand-labeling.

**Output schema:**

```
{ run_id, turn, file, span,
  diag_code, diag_msg,
  before_text, after_text,
  fix_kind: "renamed_member" | "fluent_to_named" | "overload_swap" | ... }
```

Rule induction is then "cluster by `diag_code` + AST-shape-of-`before` and pick the dominant transformation per cluster."

### (b) Random-app generator harness *(the missing piece — proposed here)*

The eval corpus today is two scenarios (calc, kanban). To bootstrap rules with statistical confidence we need broader coverage. Build a generator that emits prompts spanning the control × interaction × layout × theming space, e.g.:

```
build me a {single-page app | tabbed app} that uses a {Button | CheckBox | ComboBox | DataGrid | ...}
to {trigger | bind to | filter | edit} a {string | int | enum | record | list-of-records},
laid out in a {Grid | HStack | VStack | Flex}, themed with {default | dark | accent variant}.
```

A few hundred prompts ≈ a few hundred fresh transcripts ≈ a few thousand (broken, fixed) pairs after dedup. Crucially:

- Prompts can be enumerated combinatorially or LLM-rewritten for natural variation.
- Each run is a black-box call into the existing eval harness — no harness changes needed beyond "capture the trace."
- Cost is bounded and predictable: ~$3–5 per run × 500 runs ≈ $1.5–2.5 K of one-shot training corpus.

**The user observation that motivates this is correct:** we don't have direct *production* frequency yet, but the random-app corpus is enough to seed the pattern rules. Production frequency comes later from telemetry on `mur check` itself (count which suggestions fire, which the agent accepts, which it overrides).

### (c) (Optional) Synthetic mutations as a third stream

Take samples in `samples/apps/` (13 apps as of today — `a11y-showcase`, `chat`, `monaco-editor`, `regedit`, `wordpuzzle`, ...), apply rule-based corruptions (rename methods, fluentize named args, swap overloads), pair (broken, fix). Cheap to scale to 100K pairs but distribution is artificial. Use as a **validation** set, not training set, to keep training honest.

## Implementation plan

Phased so each phase is independently shippable.

### Phase 0 — instrumentation (no behavior change)

- Extend `CheckCommand` to emit a structured trace (`mur check --trace <path>`) capturing every CS-diagnostic that resolves to a `Microsoft.UI.Reactor.*` symbol, even when no suggestion fires today.
- Land the trace-mining script that walks an eval's event log to extract (broken, fixed) pairs.
- Run a 50×N sweep on existing calc + kanban prompts. Output: a frequency-ranked list of CS codes + receiver-types that touch Reactor.

**Exit criterion:** we have a ranked list of the top ~20 CS-diagnostic patterns the agent hits.

### Phase 1 — Tier 2 (Roslyn semantic suggester)

- Add `Reactor.Cli.Check.Suggesters.SymbolSuggester`. Inputs: `Compilation`, the `Diagnostic`, the `SyntaxNode` at the span. Outputs: ordered list of `(suggestion_text, confidence, evidence)`.
- Cover CS1061 (member missing), CS0103 (name not in scope), CS0117 (no static member), CS1503 (argument type mismatch), CS7036 (no overload).
- Fuzzy-match using JaroWinkler against:
  - in-scope members of the receiver's `ITypeSymbol`,
  - the factory set in `Microsoft.UI.Reactor.Factories`,
  - parameter names of all overloads of the enclosing factory call.
- `Microsoft.CodeAnalysis.CSharp` 4.8.0 is already a `PackageReference` in `src/Reactor.Cli/Reactor.Cli.csproj`, so no dependency churn.
- Compilation cost: load once per `mur check` invocation, cache by `<csproj, file-hash>` for incremental scenarios.

**Exit criterion:** for the top ~20 patterns from Phase 0, ≥ 70 % of cases get a correct suggestion at confidence ≥ T, with false-positive rate ≤ 5 %.

### Phase 2 — random-app generator harness

- Add `tools/Reactor.PromptGen/` that enumerates the prompt grid above and writes a JSONL of prompts.
- Reuse the existing eval runner; add a `--mine-fixes` mode that extracts pairs into `tools/Reactor.PromptGen/data/fixes.jsonl`.
- Run 200–500 prompts. Aggregate to a `patterns.json` ranked by frequency.

**Exit criterion:** `patterns.json` covers ≥ 95 % of the (broken, fixed) pairs by frequency (long-tail truncated).

### Phase 3 — Tier 3 (pattern rules, induced)

- For each high-frequency cluster, hand-author a rule (`IRulePattern`) matched against the failing AST and the diagnostic code.
- Rules live in `src/Reactor.Cli/Check/Rules/*.cs`, one per pattern, each with a unit test against a captured pair.

**Exit criterion:** Tier 3 catches every cluster with frequency ≥ 5 % in the random-app corpus.

### Phase 4 — telemetry-driven Tier 4 (only if needed)

- Log `(diagnostic, candidates, picked, accepted-by-agent)` from production `mur check` invocations.
- If Tier 2+3 still leave a meaningful tail, train a small GBDT ranker over hand-engineered features. Defer until data justifies it.

## Risks

| Risk | Mitigation |
|---|---|
| Wrong suggestions corrupt the agent's reasoning | Per-rule confidence threshold; emit only above T. Telemetry on agent-accept rate per rule, auto-suppress rules with low accept. |
| Roslyn `Compilation` load is slow on cold runs | Single load per `mur check` invocation, ~200–500 ms one-time cost vs. multi-second `dotnet build` we're already paying. |
| `reactor.api.txt` and the live `Reactor.dll` drift apart | Tier 2 reads from the live `Compilation`, not the api index. The api index is a fast pre-filter only. |
| Random-app corpus over-represents simple controls | Stratify the prompt grid; weight rules by both frequency *and* per-cluster turn-cost (rare patterns that cost 5 turns matter more than common ones that cost 1). |
| Pattern rules become a maintenance treadmill | Auto-generate the `samples/`-derived validation set; CI fails when an analyzer change regresses a captured pair. |

## Predicted impact

If we accept #226's per-kanban-run breakdown (build + fix cycles ≈ 150 K tokens, 2–4 turns), and Tier 2+3 remove ~70 % of those turns:

- Tokens: ~−100 K per kanban run (~14 % of total).
- Turns: ~−2 (16.8 → ~14.8).
- Cost: ~−$0.70 per run.

Stacking with #226 §1 (richer template) + §2 (inline cheatsheet): kanban ~480 K → ~380 K tokens, putting Reactor decisively under WinUI on cost and within ~2× HTML.

## Open questions

1. **Trace format for the random-app generator:** does the existing eval harness produce a transcript rich enough to extract before/after source pairs without re-running the build? If not, what is the smallest harness change?
2. **Confidence threshold T:** start strict (≥ 0.85 JaroWinkler-equivalent) and loosen with telemetry, or start loose and tighten? Defaulting to strict preserves "silent is OK" — propose strict.
3. **Should Tier 2 also rewrite the surface form?** Today suggestions are text-only ("try: `Button(label, onClick: x)`"). A future variant could emit a unified-diff hunk; but that is a separate scope.
4. **Roslyn workspaces vs. compilations:** workspaces give us project-graph reasoning but cost more to load. Start with `Compilation` only; revisit if rules need cross-project context.

## Pointers

- Existing `mur check` implementation: `src/Reactor.Cli/Check/CheckCommand.cs`
- Existing analyzers (12 REACTOR_* IDs): `src/Reactor.Analyzers/{HookRulesAnalyzer,UseMemoCellsAnalyzer,UseThemeRefAnalyzer,RequestedThemeSetAnalyzer,UseLightweightStylingAnalyzer,AccessibilityAnalyzers,MissingWithKeyAnalyzer}.cs`
- API surface index generator: `tools/Reactor.SignaturesGen/Program.cs` → writes `skills/reactor.api.txt` + `plugins/reactor/skills/reactor-dsl/references/reactor.api.txt`
- Build/check skill (cheat table for known IDs): `plugins/reactor/skills/reactor-build-and-check/SKILL.md`
- Sample apps (training/validation corpus): `samples/apps/` (13 projects)
- Roslyn version available today: `Microsoft.CodeAnalysis.CSharp` 4.8.0 in `src/Reactor.Cli/Reactor.Cli.csproj`

---

Extends and depends on **#226 §5**. Independent of, but complementary to, #226 §1, §2.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design proposal: Roslyn-backed did-you-mean suggestions for mur check (extends #226 §5) #227

Context

Goal

Architecture: four tiers, ML is the last tier

Data-generation strategy: mine the eval harness

(a) Trace mining from existing evals

(b) Random-app generator harness (the missing piece — proposed here)

(c) (Optional) Synthetic mutations as a third stream

Implementation plan

Phase 0 — instrumentation (no behavior change)

Phase 1 — Tier 2 (Roslyn semantic suggester)

Phase 2 — random-app generator harness

Phase 3 — Tier 3 (pattern rules, induced)

Phase 4 — telemetry-driven Tier 4 (only if needed)

Risks

Predicted impact

Open questions

Pointers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Risk	Mitigation
Wrong suggestions corrupt the agent's reasoning	Per-rule confidence threshold; emit only above T. Telemetry on agent-accept rate per rule, auto-suppress rules with low accept.
Roslyn `Compilation` load is slow on cold runs	Single load per `mur check` invocation, ~200–500 ms one-time cost vs. multi-second `dotnet build` we're already paying.
`reactor.api.txt` and the live `Reactor.dll` drift apart	Tier 2 reads from the live `Compilation`, not the api index. The api index is a fast pre-filter only.
Random-app corpus over-represents simple controls	Stratify the prompt grid; weight rules by both frequency and per-cluster turn-cost (rare patterns that cost 5 turns matter more than common ones that cost 1).
Pattern rules become a maintenance treadmill	Auto-generate the `samples/`-derived validation set; CI fails when an analyzer change regresses a captured pair.

Design proposal: Roslyn-backed did-you-mean suggestions for mur check (extends #226 §5) #227

Description

Context

Goal

Architecture: four tiers, ML is the last tier

Data-generation strategy: mine the eval harness

(a) Trace mining from existing evals

(b) Random-app generator harness (the missing piece — proposed here)

(c) (Optional) Synthetic mutations as a third stream

Implementation plan

Phase 0 — instrumentation (no behavior change)

Phase 1 — Tier 2 (Roslyn semantic suggester)

Phase 2 — random-app generator harness

Phase 3 — Tier 3 (pattern rules, induced)

Phase 4 — telemetry-driven Tier 4 (only if needed)

Risks

Predicted impact

Open questions

Pointers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions