Spec 038 Phase 1: `mur check` did-you-mean engine (Tier 2 + diagnostic-count gate) by codemonkeychris · Pull Request #243 · microsoft/microsoft-ui-reactor

codemonkeychris · 2026-05-11T12:29:15Z

Lands Phase 0 (instrumentation) + Phase 1 (Tier-2 Roslyn semantic suggester) of spec 038. Closes Phase 1 of the task list at docs/specs/tasks/038-mur-check-did-you-mean-implementation.md.

Summary

Phase 0 — instrumentation. mur check --trace <path> writes one JSONL row per parsed diagnostic alongside stdout, schema mirrored in spec §0.3. Source code text never leaves the user's machine; absolute paths outside the project root are redacted to <external>.
Phase 1 — Tier-2 Roslyn semantic suggester. SymbolSuggester covers CS1061 / CS0103 / CS0117 / CS1503 / CS7036 against Microsoft.UI.Reactor.* types; emits → try: <text> // [<evidence>] on the diagnostic line above the per-code confidence threshold. Tier-1 REACTOR_* analyzer-ID hints still win ties (spec §9).
Per-code thresholds in Thresholds.cs calibrated against the 50-run corpus (Data Checkpoint B) and re-validated against the 525-run corpus (Data Checkpoint C). All five values intentionally conservative; full rationale + history in the file's header comment.
Diagnostic-count gate (spec §11 risk row, §14 Move source projects under src/ directory #8): CheckCommand.ShouldEmitSuggestions skips Tier-2 when an invocation surfaces fewer than --suggest-threshold unique CS-prefixed diagnostics. Default 3, set 0 to disable. Resolves the EC1 calc-vs-kanban split.
MUR_TELEMETRY=1 opt-in. Local-first JSONL append to ~/.mur/telemetry/<yyyy-mm-dd>.jsonl with code/suggester/confidence/evidence_short. No source text, no file paths, no machine identifiers.
Tuning harness (tests/Reactor.Tests/CheckCommandTests/Tuning/) drives the suggester against a real corpus and writes per-code (precision, recall) curves. Reproduces with MUR_TUNING_CORPUS=<path> env var.

Eval Checkpoint 1 — both arms pass

5×N batch on gpt-5.5, identical round-3 prompt as #226's Phase-7 sweep.

Arm	Cost mean (Δ vs base)	Cost median (Δ)	First-build OK
`reactor-calc-mur-check`	−4%	parity	5/5
`reactor-kanban-mur-check`	−33%	−39%	5/5

Pre-gate EC1 (2026-05-10): calc +21% (FAIL), kanban −24% (PASS). Post-gate EC1 (2026-05-11): calc neutralized, kanban win preserved and grew. Tier-2 firing rate: 1/5 (20%) on calc, 4/5 (80%) on kanban — matches the corpus's 28.7% emit rate prediction. Full per-arm tables under docs/specs/tasks/038-mur-check-did-you-mean-implementation.md → "EC1 re-run (with gate)" subsection.

Watch-item carried into Phase 2: kanban CV widened (24% prior → 54% this batch). One of five runs hit 0 firings and tracked the long-tail base path. Gate behavior is path-dependent on the agent's exploration order, not just the project's static shape. Below Phase-1 blocker threshold; Phase 2 telemetry should track per-run firing counts.

What's in the tree

src/Reactor.Cli/Check/ — CheckCommand, CheckArgs, CompilationLoader, FactoryIndex, SuggesterOrchestrator, TraceWriter, Telemetry, plus Suggesters/ (ISuggester, SymbolSuggester, StringSimilarity, Thresholds).
tests/Reactor.Tests/CheckCommandTests/ — 95 tests across suggester contract, orchestrator, factory index, trace writer, telemetry, gate, args parser, and a tuning sub-harness.
tests/Reactor.IntegrationTests/MurCheck/ — MurCheckSmokeTest against a minimal fixture.
docs/specs/tasks/038-tuning-reports/ — calibration reports + raw tuner JSON for both data checkpoints, plus the 525-run mining corpus mirrored in-tree (≈ 8 MB across four JSONL/JSON files) so future analyses are reproducible against the exact bytes even if the upstream reactor-tokenusage repo rotates.

Phase 3 priorities surfaced

The 525-run corpus reveals where Tier-2 fuzzy match is empirically wrong on Reactor types and where Phase 3 rules should pick up. Top three (full list in the tuning report):

CS0117 / Theme — *Background → SolidBackground lookup (C0019, 16 events, 1.6%).
*CS1061 / Element — WinUI-name → Reactor-shortcut family (VerticalAlignment → VAlign, Style → fluent helpers).
CS1955 / GridSize — missing-parens-on-factory (C0004, 110 events, 10.7% — largest single bucket in the corpus).

Phase 3 also needs a second-agent corpus drop to clear the Validation Gate's cross-agent reproducibility bar (#2).

Test plan

dotnet test tests/Reactor.Tests/Reactor.Tests.csproj -p:Platform=x64 --filter FullyQualifiedName~CheckCommandTests — 95/95 green post-rebase.
mur check --help shows --trace <path> and --suggest-threshold <N>.
Phase-0 trace output passes the 2 KB-per-row + project-root-only path tests.
EC1 + EC1 re-run both meet the spec's "tokens not regressed" bar.
CI run on this PR.

Deferred follow-ups (not blocking merge; cleanly scoped)

Reactor-touching integration fixture for the CS1061 Button.OnClick canonical example (needs WindowsAppSDK restore on every test run).
Wall-time perf trait test against the WinUI fixture.
Full Hamming-vector overload ranking in CS7036 (today: parameter-count distance only).
Return-type assignability filter in CS0103.
still_present_at_run_end harness fingerprint bug — Phase-4 prerequisite, not a Phase-1 blocker.

🤖 Generated with Claude Code

Today's CheckCommand accepts only <path> and hardcodes --nologo, -v:m, and -p:Platform={host arch}. Without an escape hatch the only fallback for an agent that needs to override Platform, pick a Configuration, skip restore, or pass arbitrary -p: properties is to drop mur check and run dotnet build directly - which discards every benefit the spec adds. Adds a "CLI shape and MSBuild passthrough" subsection in §8 covering: the `mur check [path] [mur-flags] [-- <msbuild args>]` shape, default-merging rules (auto-inject only if user did not specify), boundary semantics (bare `--` is the unambiguous separator; unknown mur-flags error), ranker-unchanged invariant (passthrough alters the build, not diagnostic scoring), and trace faithfulness (record effective command line). Adds an implementation bullet to Phase 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the convention from docs/specs/tasks/036-window-design- implementation.md: phased checkboxes, "task is done only when" gate, conventions header. Adapted for spec 038's external dependency on spec 037's data corpus. Adds three concepts not in earlier task lists because no other spec needed them: - Human Validation Gate: six-bar checklist every Tier 3 rule must clear before merge (frequency >=5%, count >=10, cross- agent reproducibility, >=3 positive fixtures, >=2 negative, independent reviewer signoff). Plus auto-suppression policy for rules whose telemetry accept-rate drops below 50%. - Data Checkpoints A/B/C/D: staged hand-offs from spec 037's harness with explicit blocking relationships. A is the current 3-pair smoke (already landed); B/C/D gate Phase 1 threshold tuning, Phase 3 rule authoring, and Phase 4 ranker training respectively. - Eval Checkpoints EC1-EC4: 5xN batches on gpt-5.5 vs reactor- calc/kanban with predicted lift bands and pass criteria. EC1 after Tier 2 only; EC2 after deterministic ranker + 5 rules; EC3 at V1 ship (10-15 rules); EC4 if learned ranker is pursued. Quantity bars (per the user's question): 5 rules before EC2, 10-15 rules covering >=80% of fix events for V1 ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gine Phase 0 (instrumentation, no behavior change): - `mur check --trace <path>` writes one JSONL row per parsed diagnostic to <path> in addition to stdout. Schema: {ts, code, severity, file, line, col, msg, receiver_type?, member?, mode}. Source code text is never written; absolute paths outside the project root are redacted to "<external>"; rows are bounded to 2 KB. - New folders src/Reactor.Cli/Check/{Suggesters,Rules}/ with README pointers to spec 038 §5/§6 and the Validation Gate. Phase 1 (Tier-2 Roslyn semantic suggester): - ISuggester contract + SymbolSuggester covering CS1061, CS0103, CS0117, CS1503, CS7036 against Microsoft.UI.Reactor.* symbols. - CompilationLoader: csproj/source resolution, project.assets.json reference resolution, (csproj-path, file-mtime-hash) cache, symlink containment, perf budget cold ≤ 500 ms / warm ≤ 50 ms. - FactoryIndex: pre-filter over Microsoft.UI.Reactor.Factories.* static methods with cached parameter-name arrays for named-argument suggestions. - StringSimilarity (Jaro–Winkler) for fuzzy member-name matching. - SuggesterOrchestrator wires the suggester into CheckCommand.Run; Tier-1 HintFor still wins ties at the format layer (spec §9). - MUR_TELEMETRY=1 opt-in: appends (code, suggester, confidence, evidence_short) to ~/.mur/telemetry/<yyyy-mm-dd>.jsonl. Fields bounded to 256 bytes; no source text, file paths, or machine identifiers. - Args parser recognises `--trace` and `--help`; rejects unknown flags rather than silently forwarding (full passthrough lands in Phase 2). Tests: - 66 unit tests under tests/Reactor.Tests/CheckCommandTests/ covering args parsing, trace schema/redaction/length cap, compilation loader, factory index, JaroWinkler, every suggester code path (positive + negative), orchestrator filtering (Reactor-touching gate), Tier-1-wins-ties, and telemetry opt-in/byte-bound. - Integration smoke test exercising mur.exe end-to-end against a deliberately broken fixture under tests/Reactor.IntegrationTests/MurCheck/Fixtures/ SmokeFixture/. Reactor-touching CS1061 fixture is scoped as a follow-up because it requires WindowsAppSDK restore. Pause point per spec 038 task list: - Phase 1.8 exit needs Data Checkpoint B (≥50 unique pairs from spec 037's harness) for per-code threshold tuning before merging to main. - Phase 2 + 3 are blocked by Phase 1 merge / Data Checkpoint C respectively. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… blocker - Status snapshot at the top of the task list — Phase 0/1 code-complete on feat/038-mur-check, not merged to main, blocked on threshold tuning. - Phase 0 (0.1–0.5) and Phase 1 (1.1–1.7) checkboxes ticked, with one-line notes on the four items deferred as follow-ups (Reactor-touching integration fixture, perf-trait wall-time test, full Hamming-vector overload ranking, CS0103 return-type filter). - Data Checkpoint A re-audit (2026-05-10): Gap #1 (receiver_type), #2 (dedup), #4 (cosmetic) all FIXED in the new harness output. Gap #3 (ranker negative class) still NOT fixed — all 3 ranker-labels rows are positive class; for 3 runs the spec expects ~30–80 rows. - Plus a fix_kind classifier nit: ButtonElement HorizontalAlignment → HAlign should be renamed_member, not other (receiver type unchanged, member name swapped). - Recommendation against kicking off the 50-run sweep until Gap #3 lands; records that Data Checkpoint B is therefore still blocked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…un_end as Phase-4 prereq Audit pass 2 of the 6-row ranker-labels output (2026-05-10): - Gap #3 (ranker negative class) is now FIXED. 4 positive / 2 negative on addressed_by_next_fix; harness emits per-build per-diagnostic rows; three CS8012 emissions in run 5d5fef… recorded as three independent training rows, exactly per spec 037 §3 "don't dedupe across builds". - fix_kind classifier nit partially fixed — both pairs now classify as renamed_member; debatable for pair 2's structural rewrite but acceptable. - New known limitation: still_present_at_run_end is uniformly false even when the diagnostic IS in the final build (CS8012 timing-tail fingerprint quirk). Primary ranker label addressed_by_next_fix is unaffected; auxiliary agent_ignored is corrupted, which breaks the spec 038 §11 auto-suppression-telemetry hook. Tracked as a Phase-4 prerequisite — file with harness owner before Data Checkpoint D. Status updates: - Status snapshot at top: 50-run sweep cleared to start. - Data Checkpoint B: status flipped from blocked to unblocked. Added a step-by-step "pickup procedure for the next session" so the next agent can run cold once the corpus lands (audit, threshold tuning, EC1, merge). - Data Checkpoint D: documents still_present_at_run_end as the remaining prerequisite before training the learned ranker. - Phase 1.8 exit criterion points at the pickup procedure instead of the stale "blocked" note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… corpus Phase 1 ship-gate calibration (Data Checkpoint B). Adds per-code emit thresholds in src/Reactor.Cli/Check/Suggesters/Thresholds.cs, refactors SymbolSuggester so the threshold gate has a single source of truth (SuggestRaw exposes raw confidence; Suggest gates via Thresholds.For), and removes the now-redundant duplicate cut from SuggesterOrchestrator. Per-code values informed by tests/Reactor.Tests/CheckCommandTests/Tuning/ which runs the suggester against fixes.jsonl. Snapshot of the first run is in docs/specs/tasks/038-tuning-reports/2026-05-10-50run.md: - CS1061 → 0.80 (only firing in corpus was at conf 0.43; raised threshold blocks structural-rewrite false positives) - CS0103 → 0.75 default (2/2 firings at conf 1.00 matched) - CS0117 / CS1503 / CS7036 → 0.75 default (insufficient signal in 50-pair corpus; revisit at Data Checkpoint C, 500+ pairs) Tuning harness is gated on env var MUR_TUNING_CORPUS so it skips cleanly when the (sibling-repo) corpus isn't available. Phase-1 next gate: Eval Checkpoint EC1 vs. main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inding EC1 5×N (n=5 paired rounds on gpt-5.5, round-3 prompt with reflection ban + trust-the-suggestion + mur-check-is-the-build rules): kanban-variant — cost mean −24% / median −33% / CV 24% vs base 81% wins 4 of 5 paired rounds; the suggestion mechanism is itself a variance stabilizer. calc-variant — cost mean +21% / wall mean +23% real and consistent across the batch. Diagnosis: ~5–8s per-invocation mur-check overhead does not amortize on ~150-LoC projects with no API exploration surface to skip. Validated empirically — the prompt iterations that helped (round-2 reflection ban, round-3 trust-mur-as-build) closed every explore-around-the- suggestion loophole, but the floor remains. Strict EC1 pass criterion ("tokens not regressed") fails on calc, passes cleanly on kanban. Captured as: - spec 038 §11 risk row + §14 open question on a project-size / per-invocation-diagnostic-count gate - task doc EC1 results section with per-arm means/sd, paired comparisons - status snapshot updated; Phase 1 merge decision noted as pending Decision options on the merge: (a) ship Phase 1 as-is and accept the calc tax; (b) land the gate before merging. Either is product-side; this commit only records the data and the design surface. No code changes. SKILL.md intentionally left untouched — larger mining sweep is running under the current SKILL.md to compare aided vs un-aided baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l-test race ThresholdsTests.Each_handled_code_has_a_threshold_in_valid_range was flaking because the tuner test (ThresholdTuningTests.Run_against_inline_one_row_corpus_produces_report) temporarily writes Thresholds.PerCode to an all-zero map for its duration, and xUnit's default cross-class parallelism let the validation tests observe the zeroed state mid-tuner-run. Switching the override channel to AsyncLocal isolates the tuner's scoped change to its own logical thread; concurrent readers in other tests see the immutable production defaults. Production code paths still go through the same Thresholds.For(code) entry point — behaviour for the CLI is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ation Two related changes: 1. Diagnostic-count gate (spec §11 risk row, §14 #8) — resolves the EC1 calc-vs-kanban split where ~150-LoC calc regressed +21% cost while kanban won −24%. `CheckCommand.ShouldEmitSuggestions` skips Tier-2 when an invocation surfaces fewer than `--suggest-threshold` unique CS-prefixed diagnostics; default 3, 0 disables. Counts the same dedup key `EmitDiagnostics` uses. 2. Data Checkpoint C — 525-pair mining corpus from spec 037's harness mirrored at `docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/` (1,027 fixes / 1,233 ranker rows / 104 clusters; `gpt-5.5` only). Tuning report at `2026-05-11-525run.md`. The empirical CS-diagnostics-per-build distribution from the corpus (43% of builds have 1 diagnostic, 28% have 2, 28.7% have ≥ 3) confirms T=3 is the right initial cut-line. The 525-run corpus also surfaces that JaroWinkler fuzzy match has near-0% empirical precision on CS1061/CS0117 against Reactor types — the agent's typical mistake is reaching for a WinUI-style name (`.VerticalAlignment`, `Theme.AppBackground`) whose Reactor replacement is too far in edit-distance to find, so we pick a wrong sibling. All per-code thresholds held; systematic fix is Phase-3 rule authoring. Top three rule targets identified in the report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

5×N re-run on `feat/038-mur-check` @ aaa4cce (gate enabled, default --suggest-threshold 3), same matrix as the prior EC1 batch. reactor-calc-mur-check : cost -4% (was +21%) — PASS reactor-kanban-mur-check: cost -33% (was -24%) — PASS, grew the win First-build OK 5/5 both variant arms. Calc neutralized; kanban win preserved and grew. Phase 1 acceptance bar met. Spec §11 risk row + §14 #8 updated to mark the mitigation validated. Watch-item carried into Phase 2: kanban CV widened (24% prior -> 54% this batch) because one of five runs hit 0 firings and tracked the long-tail base path. Gate is path-dependent on agent's exploration order, not just the project's static shape. Below Phase-1 blocker threshold; Phase 2 telemetry should track per-run firing counts. No code change in this commit — eval results + spec doc updates only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Implements Spec 038 Phase 0/1 for mur check: JSONL trace instrumentation, a Tier-2 Roslyn semantic “did-you-mean” suggester (with per-code thresholds), an opt-in local telemetry append, and a diagnostic-count gate to avoid Tier-2 overhead on small builds.

Changes:

Add --trace JSONL output and opt-in local telemetry for suggestion emissions.
Add Tier-2 Roslyn-based suggestions for select CS diagnostics, including orchestration + per-code confidence thresholds and a per-invocation diagnostic-count gate.
Add extensive unit-test coverage plus a smoke integration fixture for mur check.

Show a summary per file

File	Description
tests/Reactor.Tests/CheckCommandTests/Tuning/ThresholdTuningTests.cs	End-to-end + unit surfaces for the threshold tuning harness.
tests/Reactor.Tests/CheckCommandTests/Tuning/SuggesterTuner.cs	Drives `SymbolSuggester` over corpus rows to compute tuning summaries.
tests/Reactor.Tests/CheckCommandTests/Tuning/ReactorCorpusStubs.cs	Stub Reactor/WinUI surface for corpus compilation in tuning runs.
tests/Reactor.Tests/CheckCommandTests/Tuning/CorpusReaderTests.cs	Tests JSONL corpus parsing behavior (tolerant read, required fields).
tests/Reactor.Tests/CheckCommandTests/Tuning/CorpusReader.cs	JSONL corpus parser into `CorpusFix` structures.
tests/Reactor.Tests/CheckCommandTests/TraceWriterTests.cs	Validates trace schema, size cap, and path redaction behavior.
tests/Reactor.Tests/CheckCommandTests/TestCompilation.cs	In-memory Roslyn compilation builder excluding real Reactor/WinUI assemblies.
tests/Reactor.Tests/CheckCommandTests/TelemetryTests.cs	Verifies opt-in telemetry append and field-size constraints.
tests/Reactor.Tests/CheckCommandTests/Suggesters/ThresholdsTests.cs	Pins `Thresholds` contract and override behavior.
tests/Reactor.Tests/CheckCommandTests/Suggesters/SymbolSuggesterTests.cs	Unit tests for Tier-2 suggester behaviors across supported CS codes.
tests/Reactor.Tests/CheckCommandTests/Suggesters/SuggesterContractTests.cs	Contract/shape tests for suggester context/result types.
tests/Reactor.Tests/CheckCommandTests/Suggesters/StringSimilarityTests.cs	Tests for Jaro–Winkler similarity helper.
tests/Reactor.Tests/CheckCommandTests/SuggesterOrchestratorTests.cs	Orchestrator wiring tests, including Tier-1 vs Tier-2 precedence.
tests/Reactor.Tests/CheckCommandTests/FactoryIndexTests.cs	Tests indexing of `Microsoft.UI.Reactor.Factories` for suggestions.
tests/Reactor.Tests/CheckCommandTests/CompilationLoaderTests.cs	Tests compilation loading, caching, and exclusions (obj/bin).
tests/Reactor.Tests/CheckCommandTests/CheckCommandPipelineTests.cs	Pipeline tests for parsing, dedupe, trace emission, and gate behavior.
tests/Reactor.Tests/CheckCommandTests/CheckArgsTests.cs	Tests for the new `mur check` argument parsing/help text.
tests/Reactor.IntegrationTests/Reactor.IntegrationTests.csproj	Excludes broken fixture `.cs` from compilation; includes them as `None`.
tests/Reactor.IntegrationTests/MurCheck/MurCheckSmokeTest.cs	End-to-end smoke test invoking `mur check` on a broken fixture.
tests/Reactor.IntegrationTests/MurCheck/Fixtures/SmokeFixture/SmokeFixture.csproj	Minimal broken fixture project for smoke integration test.
tests/Reactor.IntegrationTests/MurCheck/Fixtures/SmokeFixture/Program.cs	Broken code to trigger CS1061 on a non-Reactor receiver.
src/Reactor.Cli/Check/TraceWriter.cs	JSONL trace writer with redaction + per-row size constraints.
src/Reactor.Cli/Check/Telemetry.cs	Opt-in local telemetry writer for suggestion emissions.
src/Reactor.Cli/Check/Suggesters/Thresholds.cs	Per-diagnostic-code confidence thresholds (async-local override for tests).
src/Reactor.Cli/Check/Suggesters/SymbolSuggester.cs	Tier-2 semantic suggester across CS1061/0103/0117/1503/7036.
src/Reactor.Cli/Check/Suggesters/StringSimilarity.cs	Jaro–Winkler similarity implementation used by Tier-2 matching.
src/Reactor.Cli/Check/Suggesters/README.md	Tier-2 suggester design/constraints documentation.
src/Reactor.Cli/Check/Suggesters/ISuggester.cs	Suggester contract + context/result types.
src/Reactor.Cli/Check/SuggesterOrchestrator.cs	Orchestrates suggesters against MSBuild diagnostics + Roslyn compilation.
src/Reactor.Cli/Check/Rules/README.md	Tier-3 rules overview and validation gate pointer.
src/Reactor.Cli/Check/FactoryIndex.cs	Factory method index for suggestions (name + overload metadata).
src/Reactor.Cli/Check/CompilationLoader.cs	Loads/caches Roslyn compilation from a project on disk and assets.json refs.
src/Reactor.Cli/Check/CheckCommand.cs	Adds args parsing/help, trace/telemetry wiring, suggestions + gate.
src/Reactor.Cli/Check/CheckArgs.cs	Parses `--trace` and `--suggest-threshold`, emits help text.
docs/specs/tasks/038-tuning-reports/2026-05-11-525run.md	Corpus analysis report for Data Checkpoint C.
docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/unresolved.jsonl	Mirrored corpus artifact for reproducibility (unresolved rows).
docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/README.md	Explains mirrored corpus files and intended uses.
docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/gate-distribution.py	Script to compute gate distribution from ranker-labels JSONL.
docs/specs/tasks/038-tuning-reports/2026-05-10-50run.md	Data Checkpoint B tuning report snapshot.
docs/specs/tasks/038-tuning-reports/2026-05-10-50run.json	Data Checkpoint B raw tuner JSON snapshot.
docs/specs/tasks/038-mur-check-did-you-mean-implementation.md	Implementation task tracker updated with Phase 0/1 status/results.
docs/specs/038-mur-check-did-you-mean-design.md	Spec updated for passthrough and gate rationale/resolution.
CHANGELOG.md	Documents new `mur check` features and calibration artifacts.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

src/Reactor.Cli/Check/CheckCommand.cs:55

CheckArgs/CheckCommand documentation says <path> can be a single .cs file, but Run() ultimately invokes dotnet build <path>. dotnet build does not accept an arbitrary .cs file as a build target, so passing a .cs path will reliably fail. Suggest either resolving the nearest .csproj (and using that for both dotnet build and projectRoot/CompilationLoader), or rejecting .cs inputs and updating help text to match the supported contract.

        var path = parsed.Path;
        if (!File.Exists(path) && !Directory.Exists(path))
        {
            Console.Error.WriteLine($"mur check: '{path}' not found.");
            return 1;
        }

Files reviewed: 44/47 changed files
Comments generated: 7

Layered walkthrough of the system as shipped in this PR (Phase 0 + Phase 1 + diagnostic-count gate): plain-language intro, four-tier architecture, mining pipeline, threshold tuning, end-to-end recommendation flow, the small-project gate, and a future-improvements section sketching Phases 2–4. Lives at docs/reference/ alongside the other developer-facing references; points back to docs/specs/038… for decision history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses six review comments from Copilot reviewer on the Phase-1 PR: 1. Hoist CompilationLoader.Load to once per `mur check` invocation instead of once per emitted diagnostic. Even with the internal cache, Load() re-enumerates `.cs` files and recomputes the file-set hash on every call, so the prior wiring was O(diagnostics × files). Now O(1) per invocation via SuggesterOrchestrator.SuggestAgainst. 2. FindTreeFor suffix match required a path-separator boundary so a diagnostic on bare "Program.cs" no longer mis-binds to a sibling tree like "MyProgram.cs". Cross-platform: matches against both '/' and '\' prefixes since CSharpCompilation trees can carry either separator depending on the project's source-list. Added regression test. 3. Filter synthesized property/event accessors (get_X / set_X / add_X / remove_X) out of CollectStaticMembers + CollectInstanceMembers. The 525-run calibration report flagged conf=0.88 emissions of `Theme.get_Background` — same hazard would surface on CS1061 against instance-member walks. Fixed at the source. 4. Telemetry.Truncate now enforces a true UTF-8 byte limit, not a char limit. Pure-ASCII content still hits the cheap fast path. New test covers a 200-glyph CJK string (600 bytes pre-truncation). 5. StringSimilarity header comment no longer claims "allocation-free on the hot path" — Jaro() allocates two bool[]s per call. Comment now reflects reality with a pointer to revisit via stackalloc if perf- trait tests show it on the hot path. 6. Drop the "`<path>` accepts a .cs file" claim from CheckArgs.HelpText and the CheckCommand.cs header. `dotnet build <single.cs>` doesn't work end-to-end so the help text was misleading. CompilationLoader still walks up to the nearest .csproj when seeded from a .cs file (tooling/test seam only). Skipped: #6 (MurCheckSmokeTest.cs build-on-demand for mur.exe) — changes the integration-tests CI contract and is cleanly scoped as a separate follow-up. Full suite: 7051/7097 pass (46 skipped — perf-trait + opt-in, unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…; resolve symbol-binding Doc-only updates across spec, task list, and reference doc to reflect the decision that `mur check`'s did-you-mean engine is load-bearing infrastructure for the next 1-3 years, not a token-saving stopgap base- model improvements will erode. Why now: Reactor will keep churning faster than models retrain, and WinUI 3 is structurally weak / cross-confused with WPF/Silverlight/WinUI 1/2 in training data. The 525-run corpus directly evidences the confusion shape (`.VerticalAlignment` on Reactor `*Element` types, `Theme.AppBackground`, `.Style(...)`). Tier-2 fuzzy match cannot bridge those by edit distance; only deterministic vocabulary-translation rules can. Spec changes (038-mur-check-did-you-mean-design.md): - §1: new "Why this is load-bearing" subsection with the two-condition argument and explicit sunset criterion (≥12mo API stability + ≥90% first-build-OK on ≥2 vendor-distinct models). - §6: rename "induced pattern rules" → "induced and authored pattern rules"; split into Class A (induced, frequency-justified) and Class B (vocabulary-translation, structurally justified, frequency bar waived). Add the resolved symbol-binding decision: rules bind to Roslyn ISymbol references via RuleSymbolResolver, not name strings, with a CI gate that fails the build on unresolved rule targets. - §11: two new risks — Reactor API churn invalidating rules/corpus, and data-pipeline SLA/owner — both with mitigations. - §13 Phase 5: reframe from "only if needed" to "scheduled, deferred" pending Data Checkpoint D. Add explicit sunset criterion subsection. - §14: resolve question #8 (symbol-binding decision recorded with rationale); add question #9 on vocab-table provenance; renumber project-size gate to #10. Task list (tasks/038-mur-check-did-you-mean-implementation.md): - Top: new "Framing (read this first)" section with the load-bearing argument and sunset criterion link. - Phase 3: split rule template into Class A / Class B variants. Add §3.0 pre-phase prerequisites (corpus-pipeline owner; minor-release refresh cadence; in-repo `038-vocab-table.csv`). Add §3.1a symbol-binding contract with CI gate. Update quantity gates + exit criterion for the two-class split. - Phase 4: status change from "optional" to "scheduled, deferred until Data Checkpoint D"; escape hatch remains as the unexpected outcome. - New "Maintenance (load-bearing operation)" section under cross-cutting concerns: API-churn protocol per Reactor minor; corpus freshness rule; per-rule accept-rate monitoring; annual sunset-readiness check. Reference doc (docs/reference/mur-check-did-you-mean.md): - §1: new "Why this is load-bearing" subsection mirrored from the spec. - §9 Future improvements: new "The recurring failure mode: WinUI/WPF vocabulary confusion" subsection that names the structural failure shape, then reframes Phase 3 as Class A / Class B and Phase 4 as scheduled-not-optional. Symbol-binding decision called out. - Out-of-scope (small LLM generator): reasoning now explicitly linked to the load-bearing argument (weak training data is why we need the system AND why a smaller model trained on it won't fix the gap). - Glossary: new entries for Class A / Class B rule, Load-bearing, Sunset criterion. Validation Gate entry updated for the Class-B frequency-bar waiver. - Closing: pointer to the new task-doc Maintenance section. No code changes in this commit; all infrastructure decisions documented here will land in Phase-3 rule PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Quality and self-consistency pass on the Phase-1 branch. All changes are contained to Phase-1 surface; broader CR concerns about CS1061/CS0117 fuzzy behavior, MSBuild-accurate loading, and per-code precision gating remain scoped to Phase 2/3 per the spec's existing plan. - CR #10: drop CS8602 warning in CompilationLoader by routing EmptyCompilation through a named factory method with explicit Array.Empty<> coalescing. - CR #11: diagnostic regex now uses a reluctant file capture anchored on the (line,col): suffix so MSBuild lines with parenthesized path segments parse correctly. Added pipeline test. - CR #8: TraceWriter.SanitizePath normalizes in-root absolute paths to project-relative (forward-slash) form so traces no longer carry `C:\Users\<name>\...` prefixes. Tests strengthened to assert no row is ever an absolute path. - CR #13: replace misleading CS7036 tiebreak comment with accurate description; full Hamming-vector ranker stays a deferred follow-up per spec §1.5. - CR #14: 525-run report reworded — "500-pair volume bar met, cross-agent bar NOT met"; Theme.get_Background note marked "fixed in this branch" with code reference. - CR #2 (narrowed): tighten CS1061 factory-argument receiver check to require receiver IS-A factory return type only (dropped reverse IsAssignableFrom direction). Full AST-anchored receiver verification deferred to Phase-3 RuleSymbolResolver per §3.1a. - Implementation doc deferred-follow-ups list updated to include receiver-anchoring (e) and Phase-2 loader/precision-gate work (f). Verification: 98 CheckCommand unit tests pass (was 97 + new parens-path test); MurCheck integration test passes; CS8602 warning eliminated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codemonkeychris · 2026-05-11T14:28:57Z

Response to CR round 2

Triaged the CR feedback against the spec's phase plan. Goal: keep this PR high-quality and self-consistent without pulling Phase-2/3/4 work forward.

Addressed in this push (commit `a4938fb`)

CR	Fix
#10 CS8602 warning in `CompilationLoader`	Routed `EmptyCompilation` through a named factory method with explicit `?? Array.Empty<MetadataReference>()`. Warning gone.
#11 Diagnostic regex drops paths with parens	File capture changed from `[^()]+` to reluctant `.+?`, anchored by the `(line,col):` suffix. New pipeline test for `C:\src\Reactor (test)\Program.cs(10,5):`.
#8 Trace path leaks absolute in-root paths	`TraceWriter.SanitizePath` now normalizes in-root absolute paths to project-relative (forward-slash). Test strengthened: no row may be an absolute path.
#13 CS7036 tiebreak comment is unimplemented	Comment replaced with an accurate description; full Hamming-vector ranker remains a deferred follow-up per spec §1.5.
#14 Doc inconsistencies	525-run report: reworded "Checkpoint C bar met" → "500-pair volume bar met, cross-agent bar NOT met"; `Theme.get_Background` note marked "fixed in this branch" with code reference.
#2 (narrowed) CS1061 factory-argument receiver check	Dropped the reverse `IsAssignableFrom` direction — kept "receiver IS-A factory return type" only. Reduces over-fire surface without an AST walk. Full AST-anchored receiver verification deferred to Phase-3 `RuleSymbolResolver` per §3.1a. Code comment + impl-doc updated to reflect this.

Deferred to future phases (rationale, not pulling forward)

This repo is missing a LICENSE file #1 (CS1061/CS0117 fuzzy 0-match in corpus). The spec explicitly designs the diagnostic-count gate as the Phase-1 safety net and routes the systematic fix to Phase-3 rules (Theme.SolidBackground, VAlign/HAlign — already the top-3 cluster targets in the 525-run report). EC1 re-run with the gate passed both arms (calc −4%, kanban −33%) — that is the spec's accept criterion. The reviewer's "ship CS0103 only" alternative would discard the kanban win that EC1 measured.
This repo is missing important files #3 (CS0103 expected-type filter) — already deferred follow-up in §1.5.
Adding Microsoft SECURITY.MD #4 (MSBuild-accurate loader) — this is Phase 2 ("MSBuild passthrough + deterministic ranker").
Bump @apollo/server from 4.13.0 to 5.5.0 in /samples/apps/headtrax/service #5 (CS7036 weak ranker) — already deferred follow-up in §1.5; corpus shows only 3 firings.
Sync from ADO: Clean up Reactor build warnings (PR 15346990) #6 (per-code precision gate vs cost gate) — Phase 2 ranker/policy-table territory.
Add GitHub Actions CI for unit tests #7 (real Reactor end-to-end tests) — explicitly deferred in §1.6/1.7 (WindowsAppSDK restore cost).
Explore collection-initializer API (Option A') as alternative to fluent modifiers #9 (corpus local paths) — corpus-hygiene work; attached to the API-churn refresh protocol.
Spec 020: async resources design (AsyncValue + UseResource) #12 (silent error handling) — quality-of-life follow-up.

The implementation doc's deferred-follow-ups list was updated to call out items (e) receiver-anchoring and (f) Phase-2 loader / precision-gate work explicitly so reviewers can see them tracked.

Verification

dotnet test … --filter ~CheckCommandTests: 98 passed (was 97 + new parens-path test)
dotnet test … --filter ~MurCheck (integration): 1 passed
dotnet build src/Reactor.Cli/…: 0 errors, 0 new warnings (CS8602 eliminated; remaining warning is a pre-existing NU1903 NuGet vulnerability advisory on Nerdbank.MessagePack unrelated to this branch)

🤖 Generated with Claude Code

codemonkeychris and others added 10 commits May 11, 2026 05:25

codemonkeychris requested a review from Copilot May 11, 2026 12:31

Copilot started reviewing on behalf of codemonkeychris May 11, 2026 12:32 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

codemonkeychris and others added 4 commits May 11, 2026 05:40

codemonkeychris merged commit 5eec60d into main May 11, 2026
7 checks passed

codemonkeychris deleted the feat/038-mur-check branch May 11, 2026 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec 038 Phase 1: `mur check` did-you-mean engine (Tier 2 + diagnostic-count gate)#243

Spec 038 Phase 1: `mur check` did-you-mean engine (Tier 2 + diagnostic-count gate)#243
codemonkeychris merged 14 commits into
mainfrom
feat/038-mur-check

codemonkeychris commented May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codemonkeychris commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

codemonkeychris commented May 11, 2026

Summary

Eval Checkpoint 1 — both arms pass

What's in the tree

Phase 3 priorities surfaced

Test plan

Deferred follow-ups (not blocking merge; cleanly scoped)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codemonkeychris commented May 11, 2026

Response to CR round 2

Addressed in this push (commit a4938fb)

Deferred to future phases (rationale, not pulling forward)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Addressed in this push (commit `a4938fb`)