Skip to content

Spec 038 Phase 1: mur check did-you-mean engine (Tier 2 + diagnostic-count gate)#243

Merged
codemonkeychris merged 14 commits into
mainfrom
feat/038-mur-check
May 11, 2026
Merged

Spec 038 Phase 1: mur check did-you-mean engine (Tier 2 + diagnostic-count gate)#243
codemonkeychris merged 14 commits into
mainfrom
feat/038-mur-check

Conversation

@codemonkeychris
Copy link
Copy Markdown
Collaborator

Lands Phase 0 (instrumentation) + Phase 1 (Tier-2 Roslyn semantic suggester) of spec 038. Closes Phase 1 of the task list at docs/specs/tasks/038-mur-check-did-you-mean-implementation.md.

Summary

  • Phase 0 — instrumentation. mur check --trace <path> writes one JSONL row per parsed diagnostic alongside stdout, schema mirrored in spec §0.3. Source code text never leaves the user's machine; absolute paths outside the project root are redacted to <external>.
  • Phase 1 — Tier-2 Roslyn semantic suggester. SymbolSuggester covers CS1061 / CS0103 / CS0117 / CS1503 / CS7036 against Microsoft.UI.Reactor.* types; emits → try: <text> // [<evidence>] on the diagnostic line above the per-code confidence threshold. Tier-1 REACTOR_* analyzer-ID hints still win ties (spec §9).
  • Per-code thresholds in Thresholds.cs calibrated against the 50-run corpus (Data Checkpoint B) and re-validated against the 525-run corpus (Data Checkpoint C). All five values intentionally conservative; full rationale + history in the file's header comment.
  • Diagnostic-count gate (spec §11 risk row, §14 Move source projects under src/ directory #8): CheckCommand.ShouldEmitSuggestions skips Tier-2 when an invocation surfaces fewer than --suggest-threshold unique CS-prefixed diagnostics. Default 3, set 0 to disable. Resolves the EC1 calc-vs-kanban split.
  • MUR_TELEMETRY=1 opt-in. Local-first JSONL append to ~/.mur/telemetry/<yyyy-mm-dd>.jsonl with code/suggester/confidence/evidence_short. No source text, no file paths, no machine identifiers.
  • Tuning harness (tests/Reactor.Tests/CheckCommandTests/Tuning/) drives the suggester against a real corpus and writes per-code (precision, recall) curves. Reproduces with MUR_TUNING_CORPUS=<path> env var.

Eval Checkpoint 1 — both arms pass

5×N batch on gpt-5.5, identical round-3 prompt as #226's Phase-7 sweep.

Arm Cost mean (Δ vs base) Cost median (Δ) First-build OK
reactor-calc-mur-check −4% parity 5/5
reactor-kanban-mur-check −33% −39% 5/5

Pre-gate EC1 (2026-05-10): calc +21% (FAIL), kanban −24% (PASS). Post-gate EC1 (2026-05-11): calc neutralized, kanban win preserved and grew. Tier-2 firing rate: 1/5 (20%) on calc, 4/5 (80%) on kanban — matches the corpus's 28.7% emit rate prediction. Full per-arm tables under docs/specs/tasks/038-mur-check-did-you-mean-implementation.md → "EC1 re-run (with gate)" subsection.

Watch-item carried into Phase 2: kanban CV widened (24% prior → 54% this batch). One of five runs hit 0 firings and tracked the long-tail base path. Gate behavior is path-dependent on the agent's exploration order, not just the project's static shape. Below Phase-1 blocker threshold; Phase 2 telemetry should track per-run firing counts.

What's in the tree

  • src/Reactor.Cli/Check/CheckCommand, CheckArgs, CompilationLoader, FactoryIndex, SuggesterOrchestrator, TraceWriter, Telemetry, plus Suggesters/ (ISuggester, SymbolSuggester, StringSimilarity, Thresholds).
  • tests/Reactor.Tests/CheckCommandTests/ — 95 tests across suggester contract, orchestrator, factory index, trace writer, telemetry, gate, args parser, and a tuning sub-harness.
  • tests/Reactor.IntegrationTests/MurCheck/MurCheckSmokeTest against a minimal fixture.
  • docs/specs/tasks/038-tuning-reports/ — calibration reports + raw tuner JSON for both data checkpoints, plus the 525-run mining corpus mirrored in-tree (≈ 8 MB across four JSONL/JSON files) so future analyses are reproducible against the exact bytes even if the upstream reactor-tokenusage repo rotates.

Phase 3 priorities surfaced

The 525-run corpus reveals where Tier-2 fuzzy match is empirically wrong on Reactor types and where Phase 3 rules should pick up. Top three (full list in the tuning report):

  1. CS0117 / Theme*Background → SolidBackground lookup (C0019, 16 events, 1.6%).
  2. *CS1061 / Element — WinUI-name → Reactor-shortcut family (VerticalAlignment → VAlign, Style → fluent helpers).
  3. CS1955 / GridSize — missing-parens-on-factory (C0004, 110 events, 10.7% — largest single bucket in the corpus).

Phase 3 also needs a second-agent corpus drop to clear the Validation Gate's cross-agent reproducibility bar (#2).

Test plan

  • dotnet test tests/Reactor.Tests/Reactor.Tests.csproj -p:Platform=x64 --filter FullyQualifiedName~CheckCommandTests — 95/95 green post-rebase.
  • mur check --help shows --trace <path> and --suggest-threshold <N>.
  • Phase-0 trace output passes the 2 KB-per-row + project-root-only path tests.
  • EC1 + EC1 re-run both meet the spec's "tokens not regressed" bar.
  • CI run on this PR.

Deferred follow-ups (not blocking merge; cleanly scoped)

  • Reactor-touching integration fixture for the CS1061 Button.OnClick canonical example (needs WindowsAppSDK restore on every test run).
  • Wall-time perf trait test against the WinUI fixture.
  • Full Hamming-vector overload ranking in CS7036 (today: parameter-count distance only).
  • Return-type assignability filter in CS0103.
  • still_present_at_run_end harness fingerprint bug — Phase-4 prerequisite, not a Phase-1 blocker.

🤖 Generated with Claude Code

codemonkeychris and others added 10 commits May 11, 2026 05:25
Today's CheckCommand accepts only <path> and hardcodes --nologo,
-v:m, and -p:Platform={host arch}. Without an escape hatch the only
fallback for an agent that needs to override Platform, pick a
Configuration, skip restore, or pass arbitrary -p: properties is to
drop mur check and run dotnet build directly - which discards every
benefit the spec adds.

Adds a "CLI shape and MSBuild passthrough" subsection in §8
covering: the `mur check [path] [mur-flags] [-- <msbuild args>]`
shape, default-merging rules (auto-inject only if user did not
specify), boundary semantics (bare `--` is the unambiguous
separator; unknown mur-flags error), ranker-unchanged invariant
(passthrough alters the build, not diagnostic scoring), and trace
faithfulness (record effective command line). Adds an
implementation bullet to Phase 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the convention from docs/specs/tasks/036-window-design-
implementation.md: phased checkboxes, "task is done only when"
gate, conventions header. Adapted for spec 038's external
dependency on spec 037's data corpus.

Adds three concepts not in earlier task lists because no other
spec needed them:

- Human Validation Gate: six-bar checklist every Tier 3 rule
  must clear before merge (frequency >=5%, count >=10, cross-
  agent reproducibility, >=3 positive fixtures, >=2 negative,
  independent reviewer signoff). Plus auto-suppression policy
  for rules whose telemetry accept-rate drops below 50%.

- Data Checkpoints A/B/C/D: staged hand-offs from spec 037's
  harness with explicit blocking relationships. A is the
  current 3-pair smoke (already landed); B/C/D gate Phase 1
  threshold tuning, Phase 3 rule authoring, and Phase 4 ranker
  training respectively.

- Eval Checkpoints EC1-EC4: 5xN batches on gpt-5.5 vs reactor-
  calc/kanban with predicted lift bands and pass criteria.
  EC1 after Tier 2 only; EC2 after deterministic ranker + 5
  rules; EC3 at V1 ship (10-15 rules); EC4 if learned ranker
  is pursued.

Quantity bars (per the user's question): 5 rules before EC2,
10-15 rules covering >=80% of fix events for V1 ship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gine

Phase 0 (instrumentation, no behavior change):
- `mur check --trace <path>` writes one JSONL row per parsed diagnostic to
  <path> in addition to stdout. Schema: {ts, code, severity, file, line, col,
  msg, receiver_type?, member?, mode}. Source code text is never written;
  absolute paths outside the project root are redacted to "<external>"; rows
  are bounded to 2 KB.
- New folders src/Reactor.Cli/Check/{Suggesters,Rules}/ with README pointers
  to spec 038 §5/§6 and the Validation Gate.

Phase 1 (Tier-2 Roslyn semantic suggester):
- ISuggester contract + SymbolSuggester covering CS1061, CS0103, CS0117,
  CS1503, CS7036 against Microsoft.UI.Reactor.* symbols.
- CompilationLoader: csproj/source resolution, project.assets.json reference
  resolution, (csproj-path, file-mtime-hash) cache, symlink containment, perf
  budget cold ≤ 500 ms / warm ≤ 50 ms.
- FactoryIndex: pre-filter over Microsoft.UI.Reactor.Factories.* static
  methods with cached parameter-name arrays for named-argument suggestions.
- StringSimilarity (Jaro–Winkler) for fuzzy member-name matching.
- SuggesterOrchestrator wires the suggester into CheckCommand.Run; Tier-1
  HintFor still wins ties at the format layer (spec §9).
- MUR_TELEMETRY=1 opt-in: appends (code, suggester, confidence,
  evidence_short) to ~/.mur/telemetry/<yyyy-mm-dd>.jsonl. Fields bounded to
  256 bytes; no source text, file paths, or machine identifiers.
- Args parser recognises `--trace` and `--help`; rejects unknown flags
  rather than silently forwarding (full passthrough lands in Phase 2).

Tests:
- 66 unit tests under tests/Reactor.Tests/CheckCommandTests/ covering args
  parsing, trace schema/redaction/length cap, compilation loader, factory
  index, JaroWinkler, every suggester code path (positive + negative),
  orchestrator filtering (Reactor-touching gate), Tier-1-wins-ties, and
  telemetry opt-in/byte-bound.
- Integration smoke test exercising mur.exe end-to-end against a deliberately
  broken fixture under tests/Reactor.IntegrationTests/MurCheck/Fixtures/
  SmokeFixture/. Reactor-touching CS1061 fixture is scoped as a follow-up
  because it requires WindowsAppSDK restore.

Pause point per spec 038 task list:
- Phase 1.8 exit needs Data Checkpoint B (≥50 unique pairs from spec 037's
  harness) for per-code threshold tuning before merging to main.
- Phase 2 + 3 are blocked by Phase 1 merge / Data Checkpoint C respectively.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… blocker

- Status snapshot at the top of the task list — Phase 0/1 code-complete on
  feat/038-mur-check, not merged to main, blocked on threshold tuning.
- Phase 0 (0.1–0.5) and Phase 1 (1.1–1.7) checkboxes ticked, with one-line
  notes on the four items deferred as follow-ups (Reactor-touching
  integration fixture, perf-trait wall-time test, full Hamming-vector
  overload ranking, CS0103 return-type filter).
- Data Checkpoint A re-audit (2026-05-10): Gap #1 (receiver_type), #2
  (dedup), #4 (cosmetic) all FIXED in the new harness output. Gap #3
  (ranker negative class) still NOT fixed — all 3 ranker-labels rows are
  positive class; for 3 runs the spec expects ~30–80 rows.
- Plus a fix_kind classifier nit: ButtonElement HorizontalAlignment →
  HAlign should be renamed_member, not other (receiver type unchanged,
  member name swapped).
- Recommendation against kicking off the 50-run sweep until Gap #3 lands;
  records that Data Checkpoint B is therefore still blocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…un_end as Phase-4 prereq

Audit pass 2 of the 6-row ranker-labels output (2026-05-10):

- Gap #3 (ranker negative class) is now FIXED. 4 positive / 2 negative on
  addressed_by_next_fix; harness emits per-build per-diagnostic rows; three
  CS8012 emissions in run 5d5fef… recorded as three independent training
  rows, exactly per spec 037 §3 "don't dedupe across builds".
- fix_kind classifier nit partially fixed — both pairs now classify as
  renamed_member; debatable for pair 2's structural rewrite but acceptable.
- New known limitation: still_present_at_run_end is uniformly false even
  when the diagnostic IS in the final build (CS8012 timing-tail
  fingerprint quirk). Primary ranker label addressed_by_next_fix is
  unaffected; auxiliary agent_ignored is corrupted, which breaks the
  spec 038 §11 auto-suppression-telemetry hook. Tracked as a Phase-4
  prerequisite — file with harness owner before Data Checkpoint D.

Status updates:

- Status snapshot at top: 50-run sweep cleared to start.
- Data Checkpoint B: status flipped from blocked to unblocked. Added a
  step-by-step "pickup procedure for the next session" so the next agent
  can run cold once the corpus lands (audit, threshold tuning, EC1, merge).
- Data Checkpoint D: documents still_present_at_run_end as the remaining
  prerequisite before training the learned ranker.
- Phase 1.8 exit criterion points at the pickup procedure instead of the
  stale "blocked" note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… corpus

Phase 1 ship-gate calibration (Data Checkpoint B). Adds per-code emit
thresholds in src/Reactor.Cli/Check/Suggesters/Thresholds.cs, refactors
SymbolSuggester so the threshold gate has a single source of truth
(SuggestRaw exposes raw confidence; Suggest gates via Thresholds.For),
and removes the now-redundant duplicate cut from SuggesterOrchestrator.

Per-code values informed by tests/Reactor.Tests/CheckCommandTests/Tuning/
which runs the suggester against fixes.jsonl. Snapshot of the first run
is in docs/specs/tasks/038-tuning-reports/2026-05-10-50run.md:
  - CS1061 → 0.80 (only firing in corpus was at conf 0.43; raised
    threshold blocks structural-rewrite false positives)
  - CS0103 → 0.75 default (2/2 firings at conf 1.00 matched)
  - CS0117 / CS1503 / CS7036 → 0.75 default (insufficient signal in
    50-pair corpus; revisit at Data Checkpoint C, 500+ pairs)

Tuning harness is gated on env var MUR_TUNING_CORPUS so it skips
cleanly when the (sibling-repo) corpus isn't available.

Phase-1 next gate: Eval Checkpoint EC1 vs. main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inding

EC1 5×N (n=5 paired rounds on gpt-5.5, round-3 prompt with reflection
ban + trust-the-suggestion + mur-check-is-the-build rules):

  kanban-variant — cost mean −24% / median −33% / CV 24% vs base 81%
                   wins 4 of 5 paired rounds; the suggestion mechanism
                   is itself a variance stabilizer.

  calc-variant   — cost mean +21% / wall mean +23%
                   real and consistent across the batch.

Diagnosis: ~5–8s per-invocation mur-check overhead does not amortize on
~150-LoC projects with no API exploration surface to skip. Validated
empirically — the prompt iterations that helped (round-2 reflection
ban, round-3 trust-mur-as-build) closed every explore-around-the-
suggestion loophole, but the floor remains.

Strict EC1 pass criterion ("tokens not regressed") fails on calc,
passes cleanly on kanban. Captured as:

  - spec 038 §11 risk row + §14 open question on a project-size /
    per-invocation-diagnostic-count gate
  - task doc EC1 results section with per-arm means/sd, paired comparisons
  - status snapshot updated; Phase 1 merge decision noted as pending

Decision options on the merge: (a) ship Phase 1 as-is and accept the
calc tax; (b) land the gate before merging. Either is product-side; this
commit only records the data and the design surface.

No code changes. SKILL.md intentionally left untouched — larger
mining sweep is running under the current SKILL.md to compare aided vs
un-aided baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-test race

ThresholdsTests.Each_handled_code_has_a_threshold_in_valid_range was flaking
because the tuner test (ThresholdTuningTests.Run_against_inline_one_row_corpus_produces_report)
temporarily writes Thresholds.PerCode to an all-zero map for its duration,
and xUnit's default cross-class parallelism let the validation tests observe
the zeroed state mid-tuner-run.

Switching the override channel to AsyncLocal isolates the tuner's scoped
change to its own logical thread; concurrent readers in other tests see
the immutable production defaults. Production code paths still go through
the same Thresholds.For(code) entry point — behaviour for the CLI is
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation

Two related changes:

1. Diagnostic-count gate (spec §11 risk row, §14 #8) — resolves the EC1
   calc-vs-kanban split where ~150-LoC calc regressed +21% cost while
   kanban won −24%. `CheckCommand.ShouldEmitSuggestions` skips Tier-2
   when an invocation surfaces fewer than `--suggest-threshold` unique
   CS-prefixed diagnostics; default 3, 0 disables. Counts the same
   dedup key `EmitDiagnostics` uses.

2. Data Checkpoint C — 525-pair mining corpus from spec 037's harness
   mirrored at `docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/`
   (1,027 fixes / 1,233 ranker rows / 104 clusters; `gpt-5.5` only).
   Tuning report at `2026-05-11-525run.md`.

The empirical CS-diagnostics-per-build distribution from the corpus
(43% of builds have 1 diagnostic, 28% have 2, 28.7% have ≥ 3) confirms
T=3 is the right initial cut-line. The 525-run corpus also surfaces
that JaroWinkler fuzzy match has near-0% empirical precision on
CS1061/CS0117 against Reactor types — the agent's typical mistake is
reaching for a WinUI-style name (`.VerticalAlignment`,
`Theme.AppBackground`) whose Reactor replacement is too far in
edit-distance to find, so we pick a wrong sibling. All per-code
thresholds held; systematic fix is Phase-3 rule authoring. Top three
rule targets identified in the report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5×N re-run on `feat/038-mur-check` @ aaa4cce (gate enabled, default
--suggest-threshold 3), same matrix as the prior EC1 batch.

  reactor-calc-mur-check  : cost -4%  (was +21%) — PASS
  reactor-kanban-mur-check: cost -33% (was -24%) — PASS, grew the win
  First-build OK 5/5 both variant arms.

Calc neutralized; kanban win preserved and grew. Phase 1 acceptance bar
met. Spec §11 risk row + §14 #8 updated to mark the mitigation validated.

Watch-item carried into Phase 2: kanban CV widened (24% prior -> 54%
this batch) because one of five runs hit 0 firings and tracked the
long-tail base path. Gate is path-dependent on agent's exploration
order, not just the project's static shape. Below Phase-1 blocker
threshold; Phase 2 telemetry should track per-run firing counts.

No code change in this commit — eval results + spec doc updates only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Spec 038 Phase 0/1 for mur check: JSONL trace instrumentation, a Tier-2 Roslyn semantic “did-you-mean” suggester (with per-code thresholds), an opt-in local telemetry append, and a diagnostic-count gate to avoid Tier-2 overhead on small builds.

Changes:

  • Add --trace JSONL output and opt-in local telemetry for suggestion emissions.
  • Add Tier-2 Roslyn-based suggestions for select CS diagnostics, including orchestration + per-code confidence thresholds and a per-invocation diagnostic-count gate.
  • Add extensive unit-test coverage plus a smoke integration fixture for mur check.
Show a summary per file
File Description
tests/Reactor.Tests/CheckCommandTests/Tuning/ThresholdTuningTests.cs End-to-end + unit surfaces for the threshold tuning harness.
tests/Reactor.Tests/CheckCommandTests/Tuning/SuggesterTuner.cs Drives SymbolSuggester over corpus rows to compute tuning summaries.
tests/Reactor.Tests/CheckCommandTests/Tuning/ReactorCorpusStubs.cs Stub Reactor/WinUI surface for corpus compilation in tuning runs.
tests/Reactor.Tests/CheckCommandTests/Tuning/CorpusReaderTests.cs Tests JSONL corpus parsing behavior (tolerant read, required fields).
tests/Reactor.Tests/CheckCommandTests/Tuning/CorpusReader.cs JSONL corpus parser into CorpusFix structures.
tests/Reactor.Tests/CheckCommandTests/TraceWriterTests.cs Validates trace schema, size cap, and path redaction behavior.
tests/Reactor.Tests/CheckCommandTests/TestCompilation.cs In-memory Roslyn compilation builder excluding real Reactor/WinUI assemblies.
tests/Reactor.Tests/CheckCommandTests/TelemetryTests.cs Verifies opt-in telemetry append and field-size constraints.
tests/Reactor.Tests/CheckCommandTests/Suggesters/ThresholdsTests.cs Pins Thresholds contract and override behavior.
tests/Reactor.Tests/CheckCommandTests/Suggesters/SymbolSuggesterTests.cs Unit tests for Tier-2 suggester behaviors across supported CS codes.
tests/Reactor.Tests/CheckCommandTests/Suggesters/SuggesterContractTests.cs Contract/shape tests for suggester context/result types.
tests/Reactor.Tests/CheckCommandTests/Suggesters/StringSimilarityTests.cs Tests for Jaro–Winkler similarity helper.
tests/Reactor.Tests/CheckCommandTests/SuggesterOrchestratorTests.cs Orchestrator wiring tests, including Tier-1 vs Tier-2 precedence.
tests/Reactor.Tests/CheckCommandTests/FactoryIndexTests.cs Tests indexing of Microsoft.UI.Reactor.Factories for suggestions.
tests/Reactor.Tests/CheckCommandTests/CompilationLoaderTests.cs Tests compilation loading, caching, and exclusions (obj/bin).
tests/Reactor.Tests/CheckCommandTests/CheckCommandPipelineTests.cs Pipeline tests for parsing, dedupe, trace emission, and gate behavior.
tests/Reactor.Tests/CheckCommandTests/CheckArgsTests.cs Tests for the new mur check argument parsing/help text.
tests/Reactor.IntegrationTests/Reactor.IntegrationTests.csproj Excludes broken fixture .cs from compilation; includes them as None.
tests/Reactor.IntegrationTests/MurCheck/MurCheckSmokeTest.cs End-to-end smoke test invoking mur check on a broken fixture.
tests/Reactor.IntegrationTests/MurCheck/Fixtures/SmokeFixture/SmokeFixture.csproj Minimal broken fixture project for smoke integration test.
tests/Reactor.IntegrationTests/MurCheck/Fixtures/SmokeFixture/Program.cs Broken code to trigger CS1061 on a non-Reactor receiver.
src/Reactor.Cli/Check/TraceWriter.cs JSONL trace writer with redaction + per-row size constraints.
src/Reactor.Cli/Check/Telemetry.cs Opt-in local telemetry writer for suggestion emissions.
src/Reactor.Cli/Check/Suggesters/Thresholds.cs Per-diagnostic-code confidence thresholds (async-local override for tests).
src/Reactor.Cli/Check/Suggesters/SymbolSuggester.cs Tier-2 semantic suggester across CS1061/0103/0117/1503/7036.
src/Reactor.Cli/Check/Suggesters/StringSimilarity.cs Jaro–Winkler similarity implementation used by Tier-2 matching.
src/Reactor.Cli/Check/Suggesters/README.md Tier-2 suggester design/constraints documentation.
src/Reactor.Cli/Check/Suggesters/ISuggester.cs Suggester contract + context/result types.
src/Reactor.Cli/Check/SuggesterOrchestrator.cs Orchestrates suggesters against MSBuild diagnostics + Roslyn compilation.
src/Reactor.Cli/Check/Rules/README.md Tier-3 rules overview and validation gate pointer.
src/Reactor.Cli/Check/FactoryIndex.cs Factory method index for suggestions (name + overload metadata).
src/Reactor.Cli/Check/CompilationLoader.cs Loads/caches Roslyn compilation from a project on disk and assets.json refs.
src/Reactor.Cli/Check/CheckCommand.cs Adds args parsing/help, trace/telemetry wiring, suggestions + gate.
src/Reactor.Cli/Check/CheckArgs.cs Parses --trace and --suggest-threshold, emits help text.
docs/specs/tasks/038-tuning-reports/2026-05-11-525run.md Corpus analysis report for Data Checkpoint C.
docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/unresolved.jsonl Mirrored corpus artifact for reproducibility (unresolved rows).
docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/README.md Explains mirrored corpus files and intended uses.
docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/gate-distribution.py Script to compute gate distribution from ranker-labels JSONL.
docs/specs/tasks/038-tuning-reports/2026-05-10-50run.md Data Checkpoint B tuning report snapshot.
docs/specs/tasks/038-tuning-reports/2026-05-10-50run.json Data Checkpoint B raw tuner JSON snapshot.
docs/specs/tasks/038-mur-check-did-you-mean-implementation.md Implementation task tracker updated with Phase 0/1 status/results.
docs/specs/038-mur-check-did-you-mean-design.md Spec updated for passthrough and gate rationale/resolution.
CHANGELOG.md Documents new mur check features and calibration artifacts.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

src/Reactor.Cli/Check/CheckCommand.cs:55

  • CheckArgs/CheckCommand documentation says <path> can be a single .cs file, but Run() ultimately invokes dotnet build <path>. dotnet build does not accept an arbitrary .cs file as a build target, so passing a .cs path will reliably fail. Suggest either resolving the nearest .csproj (and using that for both dotnet build and projectRoot/CompilationLoader), or rejecting .cs inputs and updating help text to match the supported contract.
        var path = parsed.Path;
        if (!File.Exists(path) && !Directory.Exists(path))
        {
            Console.Error.WriteLine($"mur check: '{path}' not found.");
            return 1;
        }
  • Files reviewed: 44/47 changed files
  • Comments generated: 7

Comment thread src/Reactor.Cli/Check/CheckCommand.cs Outdated
Comment thread src/Reactor.Cli/Check/SuggesterOrchestrator.cs Outdated
Comment thread src/Reactor.Cli/Check/Suggesters/SymbolSuggester.cs
Comment thread src/Reactor.Cli/Check/Telemetry.cs Outdated
Comment thread src/Reactor.Cli/Check/Suggesters/StringSimilarity.cs Outdated
Comment thread tests/Reactor.IntegrationTests/MurCheck/MurCheckSmokeTest.cs
Comment thread src/Reactor.Cli/Check/CheckArgs.cs
codemonkeychris and others added 4 commits May 11, 2026 05:40
Layered walkthrough of the system as shipped in this PR (Phase 0 + Phase 1
+ diagnostic-count gate): plain-language intro, four-tier architecture,
mining pipeline, threshold tuning, end-to-end recommendation flow, the
small-project gate, and a future-improvements section sketching Phases
2–4. Lives at docs/reference/ alongside the other developer-facing
references; points back to docs/specs/038… for decision history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses six review comments from Copilot reviewer on the Phase-1 PR:

1. Hoist CompilationLoader.Load to once per `mur check` invocation
   instead of once per emitted diagnostic. Even with the internal cache,
   Load() re-enumerates `.cs` files and recomputes the file-set hash on
   every call, so the prior wiring was O(diagnostics × files). Now O(1)
   per invocation via SuggesterOrchestrator.SuggestAgainst.

2. FindTreeFor suffix match required a path-separator boundary so a
   diagnostic on bare "Program.cs" no longer mis-binds to a sibling tree
   like "MyProgram.cs". Cross-platform: matches against both '/' and '\'
   prefixes since CSharpCompilation trees can carry either separator
   depending on the project's source-list. Added regression test.

3. Filter synthesized property/event accessors (get_X / set_X / add_X /
   remove_X) out of CollectStaticMembers + CollectInstanceMembers. The
   525-run calibration report flagged conf=0.88 emissions of
   `Theme.get_Background` — same hazard would surface on CS1061 against
   instance-member walks. Fixed at the source.

4. Telemetry.Truncate now enforces a true UTF-8 byte limit, not a char
   limit. Pure-ASCII content still hits the cheap fast path. New test
   covers a 200-glyph CJK string (600 bytes pre-truncation).

5. StringSimilarity header comment no longer claims "allocation-free on
   the hot path" — Jaro() allocates two bool[]s per call. Comment now
   reflects reality with a pointer to revisit via stackalloc if perf-
   trait tests show it on the hot path.

6. Drop the "`<path>` accepts a .cs file" claim from CheckArgs.HelpText
   and the CheckCommand.cs header. `dotnet build <single.cs>` doesn't
   work end-to-end so the help text was misleading. CompilationLoader
   still walks up to the nearest .csproj when seeded from a .cs file
   (tooling/test seam only).

Skipped: #6 (MurCheckSmokeTest.cs build-on-demand for mur.exe) — changes
the integration-tests CI contract and is cleanly scoped as a separate
follow-up.

Full suite: 7051/7097 pass (46 skipped — perf-trait + opt-in, unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…; resolve symbol-binding

Doc-only updates across spec, task list, and reference doc to reflect the
decision that `mur check`'s did-you-mean engine is load-bearing
infrastructure for the next 1-3 years, not a token-saving stopgap base-
model improvements will erode.

Why now: Reactor will keep churning faster than models retrain, and WinUI
3 is structurally weak / cross-confused with WPF/Silverlight/WinUI 1/2 in
training data. The 525-run corpus directly evidences the confusion shape
(`.VerticalAlignment` on Reactor `*Element` types, `Theme.AppBackground`,
`.Style(...)`). Tier-2 fuzzy match cannot bridge those by edit distance;
only deterministic vocabulary-translation rules can.

Spec changes (038-mur-check-did-you-mean-design.md):
- §1: new "Why this is load-bearing" subsection with the two-condition
  argument and explicit sunset criterion (≥12mo API stability +
  ≥90% first-build-OK on ≥2 vendor-distinct models).
- §6: rename "induced pattern rules" → "induced and authored pattern
  rules"; split into Class A (induced, frequency-justified) and Class B
  (vocabulary-translation, structurally justified, frequency bar waived).
  Add the resolved symbol-binding decision: rules bind to Roslyn ISymbol
  references via RuleSymbolResolver, not name strings, with a CI gate
  that fails the build on unresolved rule targets.
- §11: two new risks — Reactor API churn invalidating rules/corpus, and
  data-pipeline SLA/owner — both with mitigations.
- §13 Phase 5: reframe from "only if needed" to "scheduled, deferred"
  pending Data Checkpoint D. Add explicit sunset criterion subsection.
- §14: resolve question #8 (symbol-binding decision recorded with
  rationale); add question #9 on vocab-table provenance; renumber
  project-size gate to #10.

Task list (tasks/038-mur-check-did-you-mean-implementation.md):
- Top: new "Framing (read this first)" section with the load-bearing
  argument and sunset criterion link.
- Phase 3: split rule template into Class A / Class B variants. Add §3.0
  pre-phase prerequisites (corpus-pipeline owner; minor-release refresh
  cadence; in-repo `038-vocab-table.csv`). Add §3.1a symbol-binding
  contract with CI gate. Update quantity gates + exit criterion for the
  two-class split.
- Phase 4: status change from "optional" to "scheduled, deferred until
  Data Checkpoint D"; escape hatch remains as the unexpected outcome.
- New "Maintenance (load-bearing operation)" section under cross-cutting
  concerns: API-churn protocol per Reactor minor; corpus freshness rule;
  per-rule accept-rate monitoring; annual sunset-readiness check.

Reference doc (docs/reference/mur-check-did-you-mean.md):
- §1: new "Why this is load-bearing" subsection mirrored from the spec.
- §9 Future improvements: new "The recurring failure mode: WinUI/WPF
  vocabulary confusion" subsection that names the structural failure
  shape, then reframes Phase 3 as Class A / Class B and Phase 4 as
  scheduled-not-optional. Symbol-binding decision called out.
- Out-of-scope (small LLM generator): reasoning now explicitly linked to
  the load-bearing argument (weak training data is why we need the
  system AND why a smaller model trained on it won't fix the gap).
- Glossary: new entries for Class A / Class B rule, Load-bearing,
  Sunset criterion. Validation Gate entry updated for the Class-B
  frequency-bar waiver.
- Closing: pointer to the new task-doc Maintenance section.

No code changes in this commit; all infrastructure decisions documented
here will land in Phase-3 rule PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Quality and self-consistency pass on the Phase-1 branch. All changes are
contained to Phase-1 surface; broader CR concerns about CS1061/CS0117 fuzzy
behavior, MSBuild-accurate loading, and per-code precision gating remain
scoped to Phase 2/3 per the spec's existing plan.

- CR #10: drop CS8602 warning in CompilationLoader by routing EmptyCompilation
  through a named factory method with explicit Array.Empty<> coalescing.
- CR #11: diagnostic regex now uses a reluctant file capture anchored on the
  (line,col): suffix so MSBuild lines with parenthesized path segments parse
  correctly. Added pipeline test.
- CR #8: TraceWriter.SanitizePath normalizes in-root absolute paths to
  project-relative (forward-slash) form so traces no longer carry
  `C:\Users\<name>\...` prefixes. Tests strengthened to assert no row is
  ever an absolute path.
- CR #13: replace misleading CS7036 tiebreak comment with accurate
  description; full Hamming-vector ranker stays a deferred follow-up per
  spec §1.5.
- CR #14: 525-run report reworded — "500-pair volume bar met, cross-agent
  bar NOT met"; Theme.get_Background note marked "fixed in this branch"
  with code reference.
- CR #2 (narrowed): tighten CS1061 factory-argument receiver check to
  require receiver IS-A factory return type only (dropped reverse
  IsAssignableFrom direction). Full AST-anchored receiver verification
  deferred to Phase-3 RuleSymbolResolver per §3.1a.
- Implementation doc deferred-follow-ups list updated to include
  receiver-anchoring (e) and Phase-2 loader/precision-gate work (f).

Verification: 98 CheckCommand unit tests pass (was 97 + new parens-path
test); MurCheck integration test passes; CS8602 warning eliminated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codemonkeychris
Copy link
Copy Markdown
Collaborator Author

Response to CR round 2

Triaged the CR feedback against the spec's phase plan. Goal: keep this PR high-quality and self-consistent without pulling Phase-2/3/4 work forward.

Addressed in this push (commit a4938fb)

CR Fix
#10 CS8602 warning in CompilationLoader Routed EmptyCompilation through a named factory method with explicit ?? Array.Empty<MetadataReference>(). Warning gone.
#11 Diagnostic regex drops paths with parens File capture changed from [^()]+ to reluctant .+?, anchored by the (line,col): suffix. New pipeline test for C:\src\Reactor (test)\Program.cs(10,5):.
#8 Trace path leaks absolute in-root paths TraceWriter.SanitizePath now normalizes in-root absolute paths to project-relative (forward-slash). Test strengthened: no row may be an absolute path.
#13 CS7036 tiebreak comment is unimplemented Comment replaced with an accurate description; full Hamming-vector ranker remains a deferred follow-up per spec §1.5.
#14 Doc inconsistencies 525-run report: reworded "Checkpoint C bar met" → "500-pair volume bar met, cross-agent bar NOT met"; Theme.get_Background note marked "fixed in this branch" with code reference.
#2 (narrowed) CS1061 factory-argument receiver check Dropped the reverse IsAssignableFrom direction — kept "receiver IS-A factory return type" only. Reduces over-fire surface without an AST walk. Full AST-anchored receiver verification deferred to Phase-3 RuleSymbolResolver per §3.1a. Code comment + impl-doc updated to reflect this.

Deferred to future phases (rationale, not pulling forward)

The implementation doc's deferred-follow-ups list was updated to call out items (e) receiver-anchoring and (f) Phase-2 loader / precision-gate work explicitly so reviewers can see them tracked.

Verification

  • dotnet test … --filter ~CheckCommandTests: 98 passed (was 97 + new parens-path test)
  • dotnet test … --filter ~MurCheck (integration): 1 passed
  • dotnet build src/Reactor.Cli/…: 0 errors, 0 new warnings (CS8602 eliminated; remaining warning is a pre-existing NU1903 NuGet vulnerability advisory on Nerdbank.MessagePack unrelated to this branch)

🤖 Generated with Claude Code

@codemonkeychris codemonkeychris merged commit 5eec60d into main May 11, 2026
7 checks passed
@codemonkeychris codemonkeychris deleted the feat/038-mur-check branch May 11, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants