Spec 038 Phase 1: mur check did-you-mean engine (Tier 2 + diagnostic-count gate)#243
Conversation
Today's CheckCommand accepts only <path> and hardcodes --nologo,
-v:m, and -p:Platform={host arch}. Without an escape hatch the only
fallback for an agent that needs to override Platform, pick a
Configuration, skip restore, or pass arbitrary -p: properties is to
drop mur check and run dotnet build directly - which discards every
benefit the spec adds.
Adds a "CLI shape and MSBuild passthrough" subsection in §8
covering: the `mur check [path] [mur-flags] [-- <msbuild args>]`
shape, default-merging rules (auto-inject only if user did not
specify), boundary semantics (bare `--` is the unambiguous
separator; unknown mur-flags error), ranker-unchanged invariant
(passthrough alters the build, not diagnostic scoring), and trace
faithfulness (record effective command line). Adds an
implementation bullet to Phase 4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the convention from docs/specs/tasks/036-window-design- implementation.md: phased checkboxes, "task is done only when" gate, conventions header. Adapted for spec 038's external dependency on spec 037's data corpus. Adds three concepts not in earlier task lists because no other spec needed them: - Human Validation Gate: six-bar checklist every Tier 3 rule must clear before merge (frequency >=5%, count >=10, cross- agent reproducibility, >=3 positive fixtures, >=2 negative, independent reviewer signoff). Plus auto-suppression policy for rules whose telemetry accept-rate drops below 50%. - Data Checkpoints A/B/C/D: staged hand-offs from spec 037's harness with explicit blocking relationships. A is the current 3-pair smoke (already landed); B/C/D gate Phase 1 threshold tuning, Phase 3 rule authoring, and Phase 4 ranker training respectively. - Eval Checkpoints EC1-EC4: 5xN batches on gpt-5.5 vs reactor- calc/kanban with predicted lift bands and pass criteria. EC1 after Tier 2 only; EC2 after deterministic ranker + 5 rules; EC3 at V1 ship (10-15 rules); EC4 if learned ranker is pursued. Quantity bars (per the user's question): 5 rules before EC2, 10-15 rules covering >=80% of fix events for V1 ship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gine
Phase 0 (instrumentation, no behavior change):
- `mur check --trace <path>` writes one JSONL row per parsed diagnostic to
<path> in addition to stdout. Schema: {ts, code, severity, file, line, col,
msg, receiver_type?, member?, mode}. Source code text is never written;
absolute paths outside the project root are redacted to "<external>"; rows
are bounded to 2 KB.
- New folders src/Reactor.Cli/Check/{Suggesters,Rules}/ with README pointers
to spec 038 §5/§6 and the Validation Gate.
Phase 1 (Tier-2 Roslyn semantic suggester):
- ISuggester contract + SymbolSuggester covering CS1061, CS0103, CS0117,
CS1503, CS7036 against Microsoft.UI.Reactor.* symbols.
- CompilationLoader: csproj/source resolution, project.assets.json reference
resolution, (csproj-path, file-mtime-hash) cache, symlink containment, perf
budget cold ≤ 500 ms / warm ≤ 50 ms.
- FactoryIndex: pre-filter over Microsoft.UI.Reactor.Factories.* static
methods with cached parameter-name arrays for named-argument suggestions.
- StringSimilarity (Jaro–Winkler) for fuzzy member-name matching.
- SuggesterOrchestrator wires the suggester into CheckCommand.Run; Tier-1
HintFor still wins ties at the format layer (spec §9).
- MUR_TELEMETRY=1 opt-in: appends (code, suggester, confidence,
evidence_short) to ~/.mur/telemetry/<yyyy-mm-dd>.jsonl. Fields bounded to
256 bytes; no source text, file paths, or machine identifiers.
- Args parser recognises `--trace` and `--help`; rejects unknown flags
rather than silently forwarding (full passthrough lands in Phase 2).
Tests:
- 66 unit tests under tests/Reactor.Tests/CheckCommandTests/ covering args
parsing, trace schema/redaction/length cap, compilation loader, factory
index, JaroWinkler, every suggester code path (positive + negative),
orchestrator filtering (Reactor-touching gate), Tier-1-wins-ties, and
telemetry opt-in/byte-bound.
- Integration smoke test exercising mur.exe end-to-end against a deliberately
broken fixture under tests/Reactor.IntegrationTests/MurCheck/Fixtures/
SmokeFixture/. Reactor-touching CS1061 fixture is scoped as a follow-up
because it requires WindowsAppSDK restore.
Pause point per spec 038 task list:
- Phase 1.8 exit needs Data Checkpoint B (≥50 unique pairs from spec 037's
harness) for per-code threshold tuning before merging to main.
- Phase 2 + 3 are blocked by Phase 1 merge / Data Checkpoint C respectively.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… blocker - Status snapshot at the top of the task list — Phase 0/1 code-complete on feat/038-mur-check, not merged to main, blocked on threshold tuning. - Phase 0 (0.1–0.5) and Phase 1 (1.1–1.7) checkboxes ticked, with one-line notes on the four items deferred as follow-ups (Reactor-touching integration fixture, perf-trait wall-time test, full Hamming-vector overload ranking, CS0103 return-type filter). - Data Checkpoint A re-audit (2026-05-10): Gap #1 (receiver_type), #2 (dedup), #4 (cosmetic) all FIXED in the new harness output. Gap #3 (ranker negative class) still NOT fixed — all 3 ranker-labels rows are positive class; for 3 runs the spec expects ~30–80 rows. - Plus a fix_kind classifier nit: ButtonElement HorizontalAlignment → HAlign should be renamed_member, not other (receiver type unchanged, member name swapped). - Recommendation against kicking off the 50-run sweep until Gap #3 lands; records that Data Checkpoint B is therefore still blocked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…un_end as Phase-4 prereq Audit pass 2 of the 6-row ranker-labels output (2026-05-10): - Gap #3 (ranker negative class) is now FIXED. 4 positive / 2 negative on addressed_by_next_fix; harness emits per-build per-diagnostic rows; three CS8012 emissions in run 5d5fef… recorded as three independent training rows, exactly per spec 037 §3 "don't dedupe across builds". - fix_kind classifier nit partially fixed — both pairs now classify as renamed_member; debatable for pair 2's structural rewrite but acceptable. - New known limitation: still_present_at_run_end is uniformly false even when the diagnostic IS in the final build (CS8012 timing-tail fingerprint quirk). Primary ranker label addressed_by_next_fix is unaffected; auxiliary agent_ignored is corrupted, which breaks the spec 038 §11 auto-suppression-telemetry hook. Tracked as a Phase-4 prerequisite — file with harness owner before Data Checkpoint D. Status updates: - Status snapshot at top: 50-run sweep cleared to start. - Data Checkpoint B: status flipped from blocked to unblocked. Added a step-by-step "pickup procedure for the next session" so the next agent can run cold once the corpus lands (audit, threshold tuning, EC1, merge). - Data Checkpoint D: documents still_present_at_run_end as the remaining prerequisite before training the learned ranker. - Phase 1.8 exit criterion points at the pickup procedure instead of the stale "blocked" note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… corpus
Phase 1 ship-gate calibration (Data Checkpoint B). Adds per-code emit
thresholds in src/Reactor.Cli/Check/Suggesters/Thresholds.cs, refactors
SymbolSuggester so the threshold gate has a single source of truth
(SuggestRaw exposes raw confidence; Suggest gates via Thresholds.For),
and removes the now-redundant duplicate cut from SuggesterOrchestrator.
Per-code values informed by tests/Reactor.Tests/CheckCommandTests/Tuning/
which runs the suggester against fixes.jsonl. Snapshot of the first run
is in docs/specs/tasks/038-tuning-reports/2026-05-10-50run.md:
- CS1061 → 0.80 (only firing in corpus was at conf 0.43; raised
threshold blocks structural-rewrite false positives)
- CS0103 → 0.75 default (2/2 firings at conf 1.00 matched)
- CS0117 / CS1503 / CS7036 → 0.75 default (insufficient signal in
50-pair corpus; revisit at Data Checkpoint C, 500+ pairs)
Tuning harness is gated on env var MUR_TUNING_CORPUS so it skips
cleanly when the (sibling-repo) corpus isn't available.
Phase-1 next gate: Eval Checkpoint EC1 vs. main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inding
EC1 5×N (n=5 paired rounds on gpt-5.5, round-3 prompt with reflection
ban + trust-the-suggestion + mur-check-is-the-build rules):
kanban-variant — cost mean −24% / median −33% / CV 24% vs base 81%
wins 4 of 5 paired rounds; the suggestion mechanism
is itself a variance stabilizer.
calc-variant — cost mean +21% / wall mean +23%
real and consistent across the batch.
Diagnosis: ~5–8s per-invocation mur-check overhead does not amortize on
~150-LoC projects with no API exploration surface to skip. Validated
empirically — the prompt iterations that helped (round-2 reflection
ban, round-3 trust-mur-as-build) closed every explore-around-the-
suggestion loophole, but the floor remains.
Strict EC1 pass criterion ("tokens not regressed") fails on calc,
passes cleanly on kanban. Captured as:
- spec 038 §11 risk row + §14 open question on a project-size /
per-invocation-diagnostic-count gate
- task doc EC1 results section with per-arm means/sd, paired comparisons
- status snapshot updated; Phase 1 merge decision noted as pending
Decision options on the merge: (a) ship Phase 1 as-is and accept the
calc tax; (b) land the gate before merging. Either is product-side; this
commit only records the data and the design surface.
No code changes. SKILL.md intentionally left untouched — larger
mining sweep is running under the current SKILL.md to compare aided vs
un-aided baseline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-test race ThresholdsTests.Each_handled_code_has_a_threshold_in_valid_range was flaking because the tuner test (ThresholdTuningTests.Run_against_inline_one_row_corpus_produces_report) temporarily writes Thresholds.PerCode to an all-zero map for its duration, and xUnit's default cross-class parallelism let the validation tests observe the zeroed state mid-tuner-run. Switching the override channel to AsyncLocal isolates the tuner's scoped change to its own logical thread; concurrent readers in other tests see the immutable production defaults. Production code paths still go through the same Thresholds.For(code) entry point — behaviour for the CLI is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation Two related changes: 1. Diagnostic-count gate (spec §11 risk row, §14 #8) — resolves the EC1 calc-vs-kanban split where ~150-LoC calc regressed +21% cost while kanban won −24%. `CheckCommand.ShouldEmitSuggestions` skips Tier-2 when an invocation surfaces fewer than `--suggest-threshold` unique CS-prefixed diagnostics; default 3, 0 disables. Counts the same dedup key `EmitDiagnostics` uses. 2. Data Checkpoint C — 525-pair mining corpus from spec 037's harness mirrored at `docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/` (1,027 fixes / 1,233 ranker rows / 104 clusters; `gpt-5.5` only). Tuning report at `2026-05-11-525run.md`. The empirical CS-diagnostics-per-build distribution from the corpus (43% of builds have 1 diagnostic, 28% have 2, 28.7% have ≥ 3) confirms T=3 is the right initial cut-line. The 525-run corpus also surfaces that JaroWinkler fuzzy match has near-0% empirical precision on CS1061/CS0117 against Reactor types — the agent's typical mistake is reaching for a WinUI-style name (`.VerticalAlignment`, `Theme.AppBackground`) whose Reactor replacement is too far in edit-distance to find, so we pick a wrong sibling. All per-code thresholds held; systematic fix is Phase-3 rule authoring. Top three rule targets identified in the report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5×N re-run on `feat/038-mur-check` @ aaa4cce (gate enabled, default --suggest-threshold 3), same matrix as the prior EC1 batch. reactor-calc-mur-check : cost -4% (was +21%) — PASS reactor-kanban-mur-check: cost -33% (was -24%) — PASS, grew the win First-build OK 5/5 both variant arms. Calc neutralized; kanban win preserved and grew. Phase 1 acceptance bar met. Spec §11 risk row + §14 #8 updated to mark the mitigation validated. Watch-item carried into Phase 2: kanban CV widened (24% prior -> 54% this batch) because one of five runs hit 0 firings and tracked the long-tail base path. Gate is path-dependent on agent's exploration order, not just the project's static shape. Below Phase-1 blocker threshold; Phase 2 telemetry should track per-run firing counts. No code change in this commit — eval results + spec doc updates only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Implements Spec 038 Phase 0/1 for mur check: JSONL trace instrumentation, a Tier-2 Roslyn semantic “did-you-mean” suggester (with per-code thresholds), an opt-in local telemetry append, and a diagnostic-count gate to avoid Tier-2 overhead on small builds.
Changes:
- Add
--traceJSONL output and opt-in local telemetry for suggestion emissions. - Add Tier-2 Roslyn-based suggestions for select CS diagnostics, including orchestration + per-code confidence thresholds and a per-invocation diagnostic-count gate.
- Add extensive unit-test coverage plus a smoke integration fixture for
mur check.
Show a summary per file
| File | Description |
|---|---|
| tests/Reactor.Tests/CheckCommandTests/Tuning/ThresholdTuningTests.cs | End-to-end + unit surfaces for the threshold tuning harness. |
| tests/Reactor.Tests/CheckCommandTests/Tuning/SuggesterTuner.cs | Drives SymbolSuggester over corpus rows to compute tuning summaries. |
| tests/Reactor.Tests/CheckCommandTests/Tuning/ReactorCorpusStubs.cs | Stub Reactor/WinUI surface for corpus compilation in tuning runs. |
| tests/Reactor.Tests/CheckCommandTests/Tuning/CorpusReaderTests.cs | Tests JSONL corpus parsing behavior (tolerant read, required fields). |
| tests/Reactor.Tests/CheckCommandTests/Tuning/CorpusReader.cs | JSONL corpus parser into CorpusFix structures. |
| tests/Reactor.Tests/CheckCommandTests/TraceWriterTests.cs | Validates trace schema, size cap, and path redaction behavior. |
| tests/Reactor.Tests/CheckCommandTests/TestCompilation.cs | In-memory Roslyn compilation builder excluding real Reactor/WinUI assemblies. |
| tests/Reactor.Tests/CheckCommandTests/TelemetryTests.cs | Verifies opt-in telemetry append and field-size constraints. |
| tests/Reactor.Tests/CheckCommandTests/Suggesters/ThresholdsTests.cs | Pins Thresholds contract and override behavior. |
| tests/Reactor.Tests/CheckCommandTests/Suggesters/SymbolSuggesterTests.cs | Unit tests for Tier-2 suggester behaviors across supported CS codes. |
| tests/Reactor.Tests/CheckCommandTests/Suggesters/SuggesterContractTests.cs | Contract/shape tests for suggester context/result types. |
| tests/Reactor.Tests/CheckCommandTests/Suggesters/StringSimilarityTests.cs | Tests for Jaro–Winkler similarity helper. |
| tests/Reactor.Tests/CheckCommandTests/SuggesterOrchestratorTests.cs | Orchestrator wiring tests, including Tier-1 vs Tier-2 precedence. |
| tests/Reactor.Tests/CheckCommandTests/FactoryIndexTests.cs | Tests indexing of Microsoft.UI.Reactor.Factories for suggestions. |
| tests/Reactor.Tests/CheckCommandTests/CompilationLoaderTests.cs | Tests compilation loading, caching, and exclusions (obj/bin). |
| tests/Reactor.Tests/CheckCommandTests/CheckCommandPipelineTests.cs | Pipeline tests for parsing, dedupe, trace emission, and gate behavior. |
| tests/Reactor.Tests/CheckCommandTests/CheckArgsTests.cs | Tests for the new mur check argument parsing/help text. |
| tests/Reactor.IntegrationTests/Reactor.IntegrationTests.csproj | Excludes broken fixture .cs from compilation; includes them as None. |
| tests/Reactor.IntegrationTests/MurCheck/MurCheckSmokeTest.cs | End-to-end smoke test invoking mur check on a broken fixture. |
| tests/Reactor.IntegrationTests/MurCheck/Fixtures/SmokeFixture/SmokeFixture.csproj | Minimal broken fixture project for smoke integration test. |
| tests/Reactor.IntegrationTests/MurCheck/Fixtures/SmokeFixture/Program.cs | Broken code to trigger CS1061 on a non-Reactor receiver. |
| src/Reactor.Cli/Check/TraceWriter.cs | JSONL trace writer with redaction + per-row size constraints. |
| src/Reactor.Cli/Check/Telemetry.cs | Opt-in local telemetry writer for suggestion emissions. |
| src/Reactor.Cli/Check/Suggesters/Thresholds.cs | Per-diagnostic-code confidence thresholds (async-local override for tests). |
| src/Reactor.Cli/Check/Suggesters/SymbolSuggester.cs | Tier-2 semantic suggester across CS1061/0103/0117/1503/7036. |
| src/Reactor.Cli/Check/Suggesters/StringSimilarity.cs | Jaro–Winkler similarity implementation used by Tier-2 matching. |
| src/Reactor.Cli/Check/Suggesters/README.md | Tier-2 suggester design/constraints documentation. |
| src/Reactor.Cli/Check/Suggesters/ISuggester.cs | Suggester contract + context/result types. |
| src/Reactor.Cli/Check/SuggesterOrchestrator.cs | Orchestrates suggesters against MSBuild diagnostics + Roslyn compilation. |
| src/Reactor.Cli/Check/Rules/README.md | Tier-3 rules overview and validation gate pointer. |
| src/Reactor.Cli/Check/FactoryIndex.cs | Factory method index for suggestions (name + overload metadata). |
| src/Reactor.Cli/Check/CompilationLoader.cs | Loads/caches Roslyn compilation from a project on disk and assets.json refs. |
| src/Reactor.Cli/Check/CheckCommand.cs | Adds args parsing/help, trace/telemetry wiring, suggestions + gate. |
| src/Reactor.Cli/Check/CheckArgs.cs | Parses --trace and --suggest-threshold, emits help text. |
| docs/specs/tasks/038-tuning-reports/2026-05-11-525run.md | Corpus analysis report for Data Checkpoint C. |
| docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/unresolved.jsonl | Mirrored corpus artifact for reproducibility (unresolved rows). |
| docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/README.md | Explains mirrored corpus files and intended uses. |
| docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/gate-distribution.py | Script to compute gate distribution from ranker-labels JSONL. |
| docs/specs/tasks/038-tuning-reports/2026-05-10-50run.md | Data Checkpoint B tuning report snapshot. |
| docs/specs/tasks/038-tuning-reports/2026-05-10-50run.json | Data Checkpoint B raw tuner JSON snapshot. |
| docs/specs/tasks/038-mur-check-did-you-mean-implementation.md | Implementation task tracker updated with Phase 0/1 status/results. |
| docs/specs/038-mur-check-did-you-mean-design.md | Spec updated for passthrough and gate rationale/resolution. |
| CHANGELOG.md | Documents new mur check features and calibration artifacts. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comments suppressed due to low confidence (1)
src/Reactor.Cli/Check/CheckCommand.cs:55
CheckArgs/CheckCommanddocumentation says<path>can be a single.csfile, butRun()ultimately invokesdotnet build <path>.dotnet builddoes not accept an arbitrary.csfile as a build target, so passing a.cspath will reliably fail. Suggest either resolving the nearest.csproj(and using that for bothdotnet buildandprojectRoot/CompilationLoader), or rejecting.csinputs and updating help text to match the supported contract.
var path = parsed.Path;
if (!File.Exists(path) && !Directory.Exists(path))
{
Console.Error.WriteLine($"mur check: '{path}' not found.");
return 1;
}
- Files reviewed: 44/47 changed files
- Comments generated: 7
Layered walkthrough of the system as shipped in this PR (Phase 0 + Phase 1 + diagnostic-count gate): plain-language intro, four-tier architecture, mining pipeline, threshold tuning, end-to-end recommendation flow, the small-project gate, and a future-improvements section sketching Phases 2–4. Lives at docs/reference/ alongside the other developer-facing references; points back to docs/specs/038… for decision history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses six review comments from Copilot reviewer on the Phase-1 PR: 1. Hoist CompilationLoader.Load to once per `mur check` invocation instead of once per emitted diagnostic. Even with the internal cache, Load() re-enumerates `.cs` files and recomputes the file-set hash on every call, so the prior wiring was O(diagnostics × files). Now O(1) per invocation via SuggesterOrchestrator.SuggestAgainst. 2. FindTreeFor suffix match required a path-separator boundary so a diagnostic on bare "Program.cs" no longer mis-binds to a sibling tree like "MyProgram.cs". Cross-platform: matches against both '/' and '\' prefixes since CSharpCompilation trees can carry either separator depending on the project's source-list. Added regression test. 3. Filter synthesized property/event accessors (get_X / set_X / add_X / remove_X) out of CollectStaticMembers + CollectInstanceMembers. The 525-run calibration report flagged conf=0.88 emissions of `Theme.get_Background` — same hazard would surface on CS1061 against instance-member walks. Fixed at the source. 4. Telemetry.Truncate now enforces a true UTF-8 byte limit, not a char limit. Pure-ASCII content still hits the cheap fast path. New test covers a 200-glyph CJK string (600 bytes pre-truncation). 5. StringSimilarity header comment no longer claims "allocation-free on the hot path" — Jaro() allocates two bool[]s per call. Comment now reflects reality with a pointer to revisit via stackalloc if perf- trait tests show it on the hot path. 6. Drop the "`<path>` accepts a .cs file" claim from CheckArgs.HelpText and the CheckCommand.cs header. `dotnet build <single.cs>` doesn't work end-to-end so the help text was misleading. CompilationLoader still walks up to the nearest .csproj when seeded from a .cs file (tooling/test seam only). Skipped: #6 (MurCheckSmokeTest.cs build-on-demand for mur.exe) — changes the integration-tests CI contract and is cleanly scoped as a separate follow-up. Full suite: 7051/7097 pass (46 skipped — perf-trait + opt-in, unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…; resolve symbol-binding Doc-only updates across spec, task list, and reference doc to reflect the decision that `mur check`'s did-you-mean engine is load-bearing infrastructure for the next 1-3 years, not a token-saving stopgap base- model improvements will erode. Why now: Reactor will keep churning faster than models retrain, and WinUI 3 is structurally weak / cross-confused with WPF/Silverlight/WinUI 1/2 in training data. The 525-run corpus directly evidences the confusion shape (`.VerticalAlignment` on Reactor `*Element` types, `Theme.AppBackground`, `.Style(...)`). Tier-2 fuzzy match cannot bridge those by edit distance; only deterministic vocabulary-translation rules can. Spec changes (038-mur-check-did-you-mean-design.md): - §1: new "Why this is load-bearing" subsection with the two-condition argument and explicit sunset criterion (≥12mo API stability + ≥90% first-build-OK on ≥2 vendor-distinct models). - §6: rename "induced pattern rules" → "induced and authored pattern rules"; split into Class A (induced, frequency-justified) and Class B (vocabulary-translation, structurally justified, frequency bar waived). Add the resolved symbol-binding decision: rules bind to Roslyn ISymbol references via RuleSymbolResolver, not name strings, with a CI gate that fails the build on unresolved rule targets. - §11: two new risks — Reactor API churn invalidating rules/corpus, and data-pipeline SLA/owner — both with mitigations. - §13 Phase 5: reframe from "only if needed" to "scheduled, deferred" pending Data Checkpoint D. Add explicit sunset criterion subsection. - §14: resolve question #8 (symbol-binding decision recorded with rationale); add question #9 on vocab-table provenance; renumber project-size gate to #10. Task list (tasks/038-mur-check-did-you-mean-implementation.md): - Top: new "Framing (read this first)" section with the load-bearing argument and sunset criterion link. - Phase 3: split rule template into Class A / Class B variants. Add §3.0 pre-phase prerequisites (corpus-pipeline owner; minor-release refresh cadence; in-repo `038-vocab-table.csv`). Add §3.1a symbol-binding contract with CI gate. Update quantity gates + exit criterion for the two-class split. - Phase 4: status change from "optional" to "scheduled, deferred until Data Checkpoint D"; escape hatch remains as the unexpected outcome. - New "Maintenance (load-bearing operation)" section under cross-cutting concerns: API-churn protocol per Reactor minor; corpus freshness rule; per-rule accept-rate monitoring; annual sunset-readiness check. Reference doc (docs/reference/mur-check-did-you-mean.md): - §1: new "Why this is load-bearing" subsection mirrored from the spec. - §9 Future improvements: new "The recurring failure mode: WinUI/WPF vocabulary confusion" subsection that names the structural failure shape, then reframes Phase 3 as Class A / Class B and Phase 4 as scheduled-not-optional. Symbol-binding decision called out. - Out-of-scope (small LLM generator): reasoning now explicitly linked to the load-bearing argument (weak training data is why we need the system AND why a smaller model trained on it won't fix the gap). - Glossary: new entries for Class A / Class B rule, Load-bearing, Sunset criterion. Validation Gate entry updated for the Class-B frequency-bar waiver. - Closing: pointer to the new task-doc Maintenance section. No code changes in this commit; all infrastructure decisions documented here will land in Phase-3 rule PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quality and self-consistency pass on the Phase-1 branch. All changes are contained to Phase-1 surface; broader CR concerns about CS1061/CS0117 fuzzy behavior, MSBuild-accurate loading, and per-code precision gating remain scoped to Phase 2/3 per the spec's existing plan. - CR #10: drop CS8602 warning in CompilationLoader by routing EmptyCompilation through a named factory method with explicit Array.Empty<> coalescing. - CR #11: diagnostic regex now uses a reluctant file capture anchored on the (line,col): suffix so MSBuild lines with parenthesized path segments parse correctly. Added pipeline test. - CR #8: TraceWriter.SanitizePath normalizes in-root absolute paths to project-relative (forward-slash) form so traces no longer carry `C:\Users\<name>\...` prefixes. Tests strengthened to assert no row is ever an absolute path. - CR #13: replace misleading CS7036 tiebreak comment with accurate description; full Hamming-vector ranker stays a deferred follow-up per spec §1.5. - CR #14: 525-run report reworded — "500-pair volume bar met, cross-agent bar NOT met"; Theme.get_Background note marked "fixed in this branch" with code reference. - CR #2 (narrowed): tighten CS1061 factory-argument receiver check to require receiver IS-A factory return type only (dropped reverse IsAssignableFrom direction). Full AST-anchored receiver verification deferred to Phase-3 RuleSymbolResolver per §3.1a. - Implementation doc deferred-follow-ups list updated to include receiver-anchoring (e) and Phase-2 loader/precision-gate work (f). Verification: 98 CheckCommand unit tests pass (was 97 + new parens-path test); MurCheck integration test passes; CS8602 warning eliminated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Response to CR round 2Triaged the CR feedback against the spec's phase plan. Goal: keep this PR high-quality and self-consistent without pulling Phase-2/3/4 work forward. Addressed in this push (commit a4938fb)
Deferred to future phases (rationale, not pulling forward)
The implementation doc's deferred-follow-ups list was updated to call out items (e) receiver-anchoring and (f) Phase-2 loader / precision-gate work explicitly so reviewers can see them tracked. Verification
🤖 Generated with Claude Code |
Lands Phase 0 (instrumentation) + Phase 1 (Tier-2 Roslyn semantic suggester) of spec 038. Closes Phase 1 of the task list at
docs/specs/tasks/038-mur-check-did-you-mean-implementation.md.Summary
mur check --trace <path>writes one JSONL row per parsed diagnostic alongside stdout, schema mirrored in spec §0.3. Source code text never leaves the user's machine; absolute paths outside the project root are redacted to<external>.SymbolSuggestercovers CS1061 / CS0103 / CS0117 / CS1503 / CS7036 againstMicrosoft.UI.Reactor.*types; emits→ try: <text> // [<evidence>]on the diagnostic line above the per-code confidence threshold. Tier-1REACTOR_*analyzer-ID hints still win ties (spec §9).Thresholds.cscalibrated against the 50-run corpus (Data Checkpoint B) and re-validated against the 525-run corpus (Data Checkpoint C). All five values intentionally conservative; full rationale + history in the file's header comment.CheckCommand.ShouldEmitSuggestionsskips Tier-2 when an invocation surfaces fewer than--suggest-thresholdunique CS-prefixed diagnostics. Default 3, set 0 to disable. Resolves the EC1 calc-vs-kanban split.MUR_TELEMETRY=1opt-in. Local-first JSONL append to~/.mur/telemetry/<yyyy-mm-dd>.jsonlwith code/suggester/confidence/evidence_short. No source text, no file paths, no machine identifiers.tests/Reactor.Tests/CheckCommandTests/Tuning/) drives the suggester against a real corpus and writes per-code (precision, recall) curves. Reproduces withMUR_TUNING_CORPUS=<path>env var.Eval Checkpoint 1 — both arms pass
5×N batch on
gpt-5.5, identical round-3 prompt as #226's Phase-7 sweep.reactor-calc-mur-checkreactor-kanban-mur-checkPre-gate EC1 (2026-05-10): calc +21% (FAIL), kanban −24% (PASS). Post-gate EC1 (2026-05-11): calc neutralized, kanban win preserved and grew. Tier-2 firing rate: 1/5 (20%) on calc, 4/5 (80%) on kanban — matches the corpus's 28.7% emit rate prediction. Full per-arm tables under
docs/specs/tasks/038-mur-check-did-you-mean-implementation.md→ "EC1 re-run (with gate)" subsection.Watch-item carried into Phase 2: kanban CV widened (24% prior → 54% this batch). One of five runs hit 0 firings and tracked the long-tail base path. Gate behavior is path-dependent on the agent's exploration order, not just the project's static shape. Below Phase-1 blocker threshold; Phase 2 telemetry should track per-run firing counts.
What's in the tree
src/Reactor.Cli/Check/—CheckCommand,CheckArgs,CompilationLoader,FactoryIndex,SuggesterOrchestrator,TraceWriter,Telemetry, plusSuggesters/(ISuggester,SymbolSuggester,StringSimilarity,Thresholds).tests/Reactor.Tests/CheckCommandTests/— 95 tests across suggester contract, orchestrator, factory index, trace writer, telemetry, gate, args parser, and a tuning sub-harness.tests/Reactor.IntegrationTests/MurCheck/—MurCheckSmokeTestagainst a minimal fixture.docs/specs/tasks/038-tuning-reports/— calibration reports + raw tuner JSON for both data checkpoints, plus the 525-run mining corpus mirrored in-tree (≈ 8 MB across four JSONL/JSON files) so future analyses are reproducible against the exact bytes even if the upstreamreactor-tokenusagerepo rotates.Phase 3 priorities surfaced
The 525-run corpus reveals where Tier-2 fuzzy match is empirically wrong on Reactor types and where Phase 3 rules should pick up. Top three (full list in the tuning report):
*Background → SolidBackgroundlookup (C0019, 16 events, 1.6%).VerticalAlignment → VAlign,Style → fluent helpers).Phase 3 also needs a second-agent corpus drop to clear the Validation Gate's cross-agent reproducibility bar (#2).
Test plan
dotnet test tests/Reactor.Tests/Reactor.Tests.csproj -p:Platform=x64 --filter FullyQualifiedName~CheckCommandTests— 95/95 green post-rebase.mur check --helpshows--trace <path>and--suggest-threshold <N>.Deferred follow-ups (not blocking merge; cleanly scoped)
Button.OnClickcanonical example (needs WindowsAppSDK restore on every test run).still_present_at_run_endharness fingerprint bug — Phase-4 prerequisite, not a Phase-1 blocker.🤖 Generated with Claude Code