Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
14 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,66 @@ to land under these conventions; subsequent specs follow this shape.

### Added

- `mur check --trace <path>` — append one JSONL row per parsed diagnostic
to `<path>` (in addition to stdout) for offline mining. Schema:
`{ts, code, severity, file, line, col, msg, receiver_type?, member?, mode}`.
Source code text is never written; absolute paths outside the project
root are redacted to `<external>`. (spec 038 §0.3)
- Tier-2 Roslyn semantic suggester for `mur check`. Covers CS1061, CS0103,
CS0117, CS1503, CS7036 against `Microsoft.UI.Reactor.*` symbols; emits
`→ try: <text> // [<evidence>]` on the diagnostic line above the per-code
confidence threshold (default 0.75). Tier-1 analyzer-ID hints still win
ties. (spec 038 §5, §1.1–§1.6)
- Per-code emit thresholds for the Tier-2 SymbolSuggester
(`src/Reactor.Cli/Check/Suggesters/Thresholds.cs`) calibrated against the
spec-037 50-run corpus. CS1061 raised to 0.80 (the structural-rewrite
fixes in the corpus would otherwise risk false positives); CS0103 / CS0117
/ CS1503 / CS7036 held at 0.75 default. Tuning harness lives in
`tests/Reactor.Tests/CheckCommandTests/Tuning/`; first run snapshot at
`docs/specs/tasks/038-tuning-reports/2026-05-10-50run.md`. (spec 038 §1.8,
Data Checkpoint B)
- EC1 5×N eval (2026-05-10): `reactor-kanban-mur-check` beats baseline on
cost mean (−24%), cost median (−33%), and wall-time variance (CV 24% vs
81%); paired analysis wins 4 of 5 rounds. `reactor-calc-mur-check`
regresses (+21% cost) because the suggester's per-invocation overhead
(~5–8s) does not amortize on ~150 LoC projects with no API exploration
surface to skip. Finding captured as a new spec 038 §11 risk + §14 open
question on a project-size / diagnostic-count gate; merge to `main`
pending product decision on path. No code change in this entry — eval
result + spec doc updates only.
- `MUR_TELEMETRY=1` opt-in: appends `(code, suggester, confidence,
evidence_short)` per emitted suggestion to
`~/.mur/telemetry/<yyyy-mm-dd>.jsonl`. Local-first, scoped to the active
project; no source code, file paths, or machine identifiers logged.
(spec 038 §10, §1.7)
- `mur check --suggest-threshold <N>` — gate Tier-2 suggestions by
per-invocation unique CS-prefixed diagnostic count. Default 3, set 0 to
always emit. Resolution of the EC1 calc-vs-kanban split: small builds
(1–2 errors) skip the ~5–8 s Tier-2 setup the agent doesn't need;
larger structural failures still get suggestions. Counts the same dedup
key `EmitDiagnostics` uses. (spec 038 §11 risk row, §14 #8)
- Data Checkpoint C (spec 038 / spec 037): 525-pair mining corpus mirrored
into `docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/`
(1,027 fixes / 1,233 ranker rows / 104 clusters from `gpt-5.5`). Analysis
in `2026-05-11-525run.md`. Cross-agent reproducibility bar still open —
a second-agent drop is required before Phase-3 rule PRs. Top Phase-3
targets surfaced: CS0117/Theme `*Background → SolidBackground`,
CS1061/`*Element` WinUI-name → Reactor-shortcut family, CS1955/GridSize
missing-parens-on-factory. Tier-2 per-code thresholds held at current
values; gate threshold (3) empirically defensible at 28.7% emit rate.
No code change in this entry — calibration + docs only. (spec 038 §1.8,
Data Checkpoint C)
- EC1 re-run with the diagnostic-count gate (2026-05-11): both arms PASS.
`reactor-calc-mur-check` cost −4% mean (was +21% in the prior batch);
`reactor-kanban-mur-check` cost −33% mean / −39% median (was −24% mean
— preserved and grew). First-build OK 5/5 both variant arms. Phase 1
acceptance bar met; Phase 1 cleared to merge to `main`. Watch-item
carried into Phase 2: kanban CV widened (24% prior → 54%) because one
of five runs hit 0 firings and took the long-tail base path — gate
behavior is path-dependent on the agent's exploration order. Below
the resolution threshold for a Phase-1 blocker; Phase 2 telemetry
should track per-run firing counts. (spec 038 §1.8 EC1 acceptance,
§11 risk row, §14 #8)
- `WindowSpec`, `ReactorWindow`, `WindowKey`, `WindowStartPosition`,
`PresenterKind`, `WindowState`, `WindowIcon`, `WindowDipSizeChangedEventArgs`,
`WindowClosingEventArgs`, `ReactorAppContext` — first-class Window primitive
Expand Down
44 changes: 44 additions & 0 deletions docs/specs/038-mur-check-did-you-mean-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,47 @@ If the agent reads all of these every turn, two pathologies follow: (a) it spend

The agent-eval prompt for Reactor (#226 §5) directs the agent to use `mur check` (iteration) during the build/fix loop and `mur check --final` once iteration mode is clean. The transition is the explicit "I am done iterating" signal.

### CLI shape and MSBuild passthrough

```
mur check [<path>] [mur-flags...] [-- <msbuild args>...]
```

Today's `CheckCommand` accepts only `<path>` and hardcodes `--nologo`, `-v:m`, and `-p:Platform={host arch}` against `dotnet build`. That's the right default but a fragile contract — agents and humans both routinely need to override `Platform`, pick a `Configuration`, skip restore, change verbosity, or pass arbitrary `-p:` properties. Without an escape hatch the only fallback is "drop `mur check` and run `dotnet build` directly," which discards every benefit this spec adds.

The fix is the standard double-dash convention: everything after a literal `--` is forwarded verbatim to `dotnet build`. `mur`-owned flags appear before `--`; MSBuild-owned flags appear after.

```
# default: ranker on, iteration mode, host-arch platform
mur check

# build x64 even on an ARM64 host
mur check -- -p:Platform=x64

# release config + skip restore + final-pass mode
mur check --final -- -c Release --no-restore

# escalate verbosity for debugging the wrapper
mur check -- -v:n

# explicit TFM
mur check -- -f net10.0-windows10.0.22621.0

# multiple properties
mur check -- -p:Platform=x64 -p:DefineConstants=FOO

# all of the above with a non-default path
mur check ./MyApp -- -c Release -p:Platform=x64
```

**Default-merging rules.** `mur` always passes `--nologo` (output stability is non-negotiable for diagnostic parsing). For `-v:` and `-p:Platform=`, `mur` injects its defaults *only if* the user did not supply the same flag in the passthrough section. Detection is by flag name, not value — `-p:Platform=x64` in passthrough wins over the auto-injected host arch; `-v:n` in passthrough wins over the wrapper's `-v:m`.

**Boundary semantics.** A bare `--` is the unambiguous separator. Tokens before it are parsed against `mur`'s own grammar; tokens after are not. Emit a clear error if the user passes an unknown `mur-flag` before `--` rather than silently forwarding (helps catch typos like `mur check --quie -- -c Release`).

**Ranker is unchanged.** Passthrough alters how `dotnet build` runs, not how its diagnostics are scored. The ranker (§8) and the suggesters (Tiers 1–4) operate on the parsed diagnostic stream regardless of which build invocation produced it. The one exception: `--strict` (which the ranker promotes warnings to errors) composes with `-p:TreatWarningsAsErrors=true` from passthrough — both apply, and the more aggressive of the two wins on each diagnostic.

**Tracing.** When `mur check --trace` is on, the trace records the *full* effective command line passed to `dotnet build`, including default-merged flags, so replays are bit-faithful.

### Ranking policy

Each diagnostic gets a score from 0.0 (suppress) to 1.0 (always emit), computed as:
Expand Down Expand Up @@ -317,6 +358,7 @@ Telemetry is local-first, opt-in, scoped to the active project. No source code,
| The corpus encodes a model's idiosyncrasies | Spec 037 supports multi-agent rotation. Tier 3 rules ship only when a cluster reproduces across ≥ 2 agents. |
| Ranker hides a load-bearing warning the agent needed to see | Telemetry on suppress→error transitions; auto-promote codes whose suppression precedes a related error > N times. `mur check --final` is mandatory before "done" — captured in eval prompt and CI. |
| Over-suppression makes the build/fix loop *worse* by hiding novel diagnostics | Default base score for unknown codes is 0.5 (above iteration threshold) — better to over-emit a novel code than to silence a real bug. Threshold tunable per agent via telemetry. |
| `mur check` per-invocation overhead (~5–8s) regresses tokens on small projects with little API surface to explore (validated empirically at EC1: ~150 LoC calc app regressed +21% cost mean even with all prompt loopholes closed; kanban won −24% in the same batch). Suggestion mechanism's amortization floor is project-size-dependent. | **Mitigated 2026-05-10:** diagnostic-count gate in `CheckCommand.ShouldEmitSuggestions`. Default skips Tier-2 when an invocation surfaces < 3 unique CS-prefixed diagnostics; `--suggest-threshold <N>` overrides (0 disables). Default N validated at Data Checkpoint C (28.7% emit rate matches calc-vs-kanban shape). **EC1 re-run with gate confirms the mitigation:** calc cost −4% (was +21%), kanban cost −33% (was −24% — preserved and grew); both pass the "tokens not regressed" bar. Residual: kanban CV widened (54% vs prior 24%) — one outlier run hit 0 firings; gate is path-dependent on agent's exploration order. Worth watching in Phase 2. See §14 #8 + spec 038 task doc "EC1 re-run" section. |

## §12 Predicted impact

Expand Down Expand Up @@ -368,6 +410,7 @@ Phased so each phase is independently shippable.

- Land §8's hand-authored `base_policy(code)` table covering the top ~30 highest-frequency diagnostic codes from Phase 0's sweep.
- Add `--strict`, `--final`, `--quiet`, `--emit-threshold` flags to `mur check`.
- **Land `--` MSBuild passthrough** per §8. Implementation: split `args` on the first bare `--`, validate the left half against `mur`'s flag grammar (error on unknowns), then default-merge the right half with `--nologo` / `-v:m` / `-p:Platform={host arch}` only where the user didn't already specify the same flag. Round-trip the effective command line into `--trace` output so replays are reproducible.
- Update the eval prompt and the `reactor-build-and-check` skill to direct agents to use iteration mode in the inner loop and `--final` once iteration is clean.
- Add the suppress→error CI guardrail: every `mur check --final` run on a successful build must surface no diagnostic that, by code alone, the table would have flagged in iteration mode but didn't.

Expand All @@ -390,6 +433,7 @@ Phased so each phase is independently shippable.
5. **Per-agent rule profiles?** If different LLM agents make different mistakes, Tier 3 rule sets could be agent-keyed. Likely premature; cross-agent rules are simpler. Reconsider once telemetry shows agent-specific deltas.
6. **Iteration-mode emit threshold (§8).** Default proposed at 0.6. Tuning lever: too high silences load-bearing warnings; too low restores the noise. Land conservative (0.5–0.55) and tighten as the policy table covers more codes? Or go aggressive (0.7) and accept that novel codes get surfaced via the unknown-code default? Settle empirically once Phase 0 produces the diagnostic-frequency distribution.
7. **Should the ranker score Tier 1 hints as well?** Today every `REACTOR_*` analyzer warning carries a static skill pointer and is implicitly emit-worthy. The ranker could in principle suppress low-priority `REACTOR_*` Info diagnostics in iteration mode (e.g. `REACTOR_HOOKS_006`, the non-idempotent fetcher heuristic). Recommendation: yes, treat Tier 1 emissions as just another diagnostic for ranking purposes; the policy table is the single source of truth for emit/suppress decisions.
8. **Project-size gate for the suggestion mechanism.** EC1 5×N (see spec 038 task doc) showed `mur check`'s per-invocation overhead regresses small projects (calc, ~150 LoC, +21% cost) even when the prompt closes every explore-around-the-suggestion side door, while medium projects (kanban) win cleanly (−24% cost, −33% median, 3.4× lower variance). The amortization floor is real and structural. Open design question: gate `SymbolSuggester` activation by (a) source-file count / total LoC, (b) per-invocation CS-diagnostic count, (c) `--mode iteration|small` flag the agent sets explicitly, or (d) leave gating to the agent prompt ("don't bother with `mur check` on apps under 200 LoC")? Approach (b) feels strongest — it self-tunes (a project with one error doesn't pay for fuzzy-match overhead) and composes with Phase 2's iteration/`--final` distinction. **Resolved 2026-05-10 with approach (b); validated 2026-05-11.** Implemented in `CheckCommand.ShouldEmitSuggestions` with default threshold N=3 unique CS-prefixed diagnostics per invocation; overridable via `--suggest-threshold <N>` (0 disables). Default cross-validated against the 525-run corpus (Data Checkpoint C, distribution: 71.3% of builds < 3, 28.7% ≥ 3). EC1 re-run with the gate (2026-05-11): calc cost −4% (was +21%), kanban cost −33% (was −24%); both pass. New residual question: kanban CV widened on the re-run because one of five runs hit 0 firings and tracked the long-tail base path — gate behavior is path-dependent on the agent's exploration order, not just the project's static shape. Telemetry at scale (post-Phase-4) will tell us whether this stays a 1-in-5 tail or grows.

## §15 Pointers

Expand Down
Loading
Loading