Derived from: docs/specs/038-mur-check-did-you-mean-design.md
Companion spec (data source): docs/specs/037-eval-trace-mining-design.md
Originating issues: #226 §5, #227, #228 (specs PR, merged)
This work is load-bearing, not a stopgap. See spec §1 "Why this is load-bearing, not a stopgap" for the full argument; the short form is (a) Reactor is experimental and will continue to churn faster than base models retrain, and (b) WinUI 3 is weakly represented in training data and confused with WPF / Silverlight / WinUI 1 / WinUI 2, so models produce plausibly-WinUI-shaped code that doesn't compile against Reactor's actual surface. The build-error-correction loop is the dominant feedback channel through which agents reconcile prior-framework priors with Reactor's reality, and that condition holds for an expected 1–3 year horizon. Investment level and rule-shipping cadence in this doc are sized accordingly.
The sunset criterion (spec §13) is explicit so "load-bearing" doesn't drift into "forever": decommission when Reactor's public surface has been stable for ≥ 12 months AND base-model first-build-OK on a held-out Reactor-touching eval exceeds 90 % on ≥ 2 vendor-distinct models without mur check assistance.
- Phase 0 (instrumentation): ✓ landed on
feat/038-mur-check. - Phase 1 (Tier 2 Roslyn suggester): ✓ code complete; ✓ calibrated against 50-run corpus (2026-05-10) and re-validated against 525-run corpus (2026-05-11). All per-code thresholds held at current values (525-run analysis is decisive on direction-of-fix but harness ClassifyMatch is an over-approximation; see report). Diagnostic-count gate landed:
CheckCommand.ShouldEmitSuggestionsskips Tier-2 when an invocation surfaces fewer than--suggest-thresholdunique CS-prefixed diagnostics (default 3; 0 disables). EC1 re-run with the gate (2026-05-11) PASSES both arms — calc cost −4% (was +21%), kanban cost −33% (was −24%; preserved and grew). Phase 1 cleared to merge tomain. - Phase 2 (MSBuild passthrough + deterministic ranker): ✓ code complete, ✓ EC2 PASS by median (2026-05-11). 2.1 passthrough parser + 2.2 mode flags landed in
src/Reactor.Cli/Check/ArgsParser.cs+Mode.cs; 2.3 ranker landed insrc/Reactor.Cli/Check/Ranker/{PolicyTable.cs, Ranker.cs}; 2.4 guardrail landed attools/Reactor.MurCheckGuardrail/; 2.5 SKILL.md updated. Phase-2.x post-batch fixes: gate-input regression (suggest-gate now counts pre-rankerdiagnostics, locked byRankerTests.Suggest_gate_counts_full_parsed_list_not_post_ranker_emittable) and--finalframing softened (no longer presented as a task-completion requirement). EC2 5×N result: calc-mur clean win on every metric (cost −5.1%, tokens −5.8%, turns −5.1%, wall −7.9%, variance 1.9× tighter); kanban-mur at cost median parity ($3.30 = $3.30) with R2 outlier driving mean to +5.7%; first-build 5/5 both arms;--finalinvocation 0/10 (SKILL framing working as designed). Phase 2 cleared to merge. - Phase 3: infrastructure (§3.1 + §3.1a) code complete and unit-tested 2026-05-11. Landed on
feat/spec-038-phase2.IRulePattern+RuleContext+RuleSuggestion+RuleRegistry+RuleSymbolResolverundersrc/Reactor.Cli/Check/Rules/;--disable-rule <Name>+--list-ruleswired throughArgsParser+CheckArgs.HelpText; orchestrator runs rules alongside Tier-2 with the spec §6 "rule wins" semantics; rules can match diagnostic codes outside the Tier-2SupportedCodes(unblocks CS1955 etc.). §3.0 vocab table landed atdocs/specs/tasks/038-vocab-table.csv(21 rows). Six rules authored as of 2026-05-11 (three Class-B + three Class-A, all ready for review):ThemeBackgroundSuffixRule(originally Class-B, promoted to Class-A by the cross-agent audit at 27 events both corpora),AlignmentShortcutRule(Class-B, CS1061),ButtonOnClickFactoryMoveRule(Class-B, CS1061 onButtonElement.OnClick),GridSizeFactoryParensRule(Class-A; CS1955 onGridSize.Auto()→ drop parens; first cross-tier rule — 146 events, top cluster both corpora),GridSizePxRenameRule(Class-A; CS0117 onGridSize.Pixel/Pixels/Fixed→Px(...); 9 cross-agent events),TextBlockStyleHintRule(Class-A; CS1061/CS0117 onTextBlockElement.Style→ fluent text helpers; 5 cross-agent events spanning both.Style(...)andwith { Style = ... }shapes). All six rules passRuleTargetResolutionTestsagainst the live Reactor compilation.RulePerformanceTests(§3.1a perf bound, ≤ 0.5 ms median per rule per diagnostic × 4× CI slack) passing with the six-rule set. - Two critical fixes landed 2026-05-11 during end-to-end smoke validation of the rule set (both blocked any real-world rule firing prior to the fix):
CompilationLoadernow resolvesProjectReferenceoutputs. The loader previously only parsedproject.assets.json'stargetssection (NuGet packages), missing thelibraries.<id>entries withtype=project. Reactor itself is a project reference for every sample app, so its types were invisible toRuleSymbolResolver.ResolveTypeand every rule self-disabled on every invocation against a real Reactor app — even though all 60+ unit tests passed (those use synthetic in-memory compilations). New code walkslibraries, reads the referenced csproj's<AssemblyName>(or falls back to the basename), and locates the most-recently-built matching.dllunder that project'sbin/subtree. Regression locked byCompilationLoaderTests.Resolves_ProjectReference_built_dll_from_project_assets_json(constructs a synthetic refproj + assets.json end-to-end and asserts the dll is in the returned reference set).- Suggest-gate carve-out for Tier-3 rules (the EC2 watch-item finally addressed). The gate (
--suggest-threshold, default 3 unique CS-prefixed diagnostics) was wrapping the entire suggester block — when closed, neither Tier-2 nor rules ran. The gate exists to suppress Tier-2 fuzzy match's noise on small builds; rules are precision-anchored (Roslyn ISymbol binding) and shouldn't be subject to that calibration.SuggesterOrchestratornow takes atier2Enabledbool;CheckCommand.Runalways builds the orchestrator (when compilation loads) and passes the gate result astier2Enabled. Tier-3 rules always run; Tier-2 stays gated. Two new tests lock this down (Rule_fires_even_when_tier2_is_disabled_by_the_suggest_gate+Tier2_only_code_returns_null_when_tier2_is_disabled_and_no_rule_covers). End-to-end smoke againstsamples/apps/wordpuzzlewith two CS errors injected (CS1955 onGridSize.Auto(), CS0117 onGridSize.Pixel) now emits both rule suggestions at the default gate. §3.1a residual landed: trace-channel structured warning hook (TraceWriter.WriteRuleSelfDisabled) routed fromSuggesterOrchestrator.onRuleSelfDisabled→CheckCommand.Run, dedup'd per-invocation per-rule. Two API/skill fixes surfaced during rule authoring: (1)ElementExtensions.Margin/.Paddingper-side overloads now default unspecified sides to 0 (eliminates 198 corpus-observed build failures from.Margin(top: 12)-shaped calls) — and the 2-arg overload order swapped to(vertical, horizontal)matching CSS shorthand (breaking change, pre-1.0; all repo callsites migrated to named-arg form for clarity); (2)plugins/reactor/skills/reactor-getting-started/SKILL.mdcheatsheet now shows the named-argButton("Save", onClick: handler)form with explicit anti-pattern call-out, and the.OnTappedexample is re-anchored to non-Button surfaces. 60+ new unit tests cumulatively across infrastructure / rule fixtures / API / orchestrator wiring; full Reactor.Tests suite 7156/7202 (46 expected-skip). Class-A rule PRs remain blocked on second-agent corpus drop (cross-agent reproducibility bar #2 of the Validation Gate); Class-B rule PRs are unblocked structurally but await reviewer signoff (bar #5).
- Phase 4: not started. Blocks on Data Checkpoint D + the
still_present_at_run_endharness fix. - Active state: Phase 2 code complete and ready for EC2. Phase 1 already merged. Data Checkpoint C source mirrored into
docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/(8 MB across four files; raw event logs stay in sibling repo). Tuning report atdocs/specs/tasks/038-tuning-reports/2026-05-11-525run.md. EC1 re-run results at the "EC1 re-run (with gate)" subsection under Eval Checkpoints below. - Second-agent corpus aggregated 2026-05-11 (handoff at
C:\temp\mur-handoff-sonnet-525.md):claude-sonnet-4.6× 525 runs (400 unaided + 125 skilled). Aggregated atC:\Users\andersonch\Code\reactor-tokenusage\mining-out\— 368 fixes / 564 ranker rows (404 unaided / 160 skilled) / 41 clusters / 171 unresolved. The aggregation was run with the gpt-5.5 1050 sweep event logs temporarily moved aside (to.pre-sonnet-holding/) and restored after — verified zero leakage (every fixes.jsonl row carriesagent="claude-sonnet-4.6"). The prior gpt-5.5 525-run aggregate is parked atmining-out/pre-sonnet-bak/for A/B comparison. Skilled flavor (125 runs) shipsMINING_NO_MUR_ADDENDUM— agent hasdotnet buildonly while still seeing skill API guidance; 125 skilled prompt_ids are a subset of the unaided 400. Cross-agent reproducibility audit landed atdocs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md— Validation Gate bar #2 is STRONG (count ≥ 3 in both corpora) for three Class-A targets: (1)(CS1955, GridSize, other)110+36=146 events, top cluster in both corpora, first cross-tier rule since Tier-2 doesn't cover CS1955; (2)(CS0117, Theme, other)16+11=27 events, already covered structurally by Class-BThemeBackgroundSuffixRule— re-classify as Class-A bar-#2 satisfied; (3)(CS0117, GridSize, renamed_member)3+4=7 events, lookup-table onAuto/Star/Pixelfamily. Two Tier-2 follow-ups (TextBlockElementMemberRule,TemplatedListViewWithKeyRule) and one striking gpt-5.5-only signal worth flagging:CS1955/GridElementfamily is 29 events in gpt-5.5 and zero in sonnet — defer to a third corpus drop before opening that rule's PR. - Watch-item carried forward into Phase 2: the EC1 re-run kanban CV widened from 24% (prior batch, no gate) to 54% (this batch, gate on). One of five kanban-variant runs hit 0 firings and took the long-tail base path. Gate behavior is path-dependent on the agent's exploration order, not just the project's static shape. Below the resolution threshold for a Phase-1 blocker; Phase 2 telemetry should track per-run firing counts so we can characterize this tail.
- Deferred follow-ups (cleanly scoped, not blocking next phase): (a) Reactor-touching integration fixture for the CS1061 Button.OnClick canonical example (needs WindowsAppSDK restore on every test run); (b) wall-time perf trait test against the WinUI fixture; (c) full Hamming-vector overload ranking in CS7036; (d) return-type assignability filter in CS0103; (e) AST-anchored receiver verification for the CS1061 factory-argument move — currently confirms only the receiver's static type matches the factory's return type, full receiver-anchoring lands in Phase-3 rule infrastructure via
RuleSymbolResolver(§3.1a); (f) Phase-2 work: MSBuild-accurate compilation loading (filesystem recursion → evaluated project model) and per-code precision-based emission gating beside the cost-based diagnostic-count gate. - Tracked harness follow-up (Phase-4 prerequisite, file with harness owner before Data Checkpoint D):
still_present_at_run_endalwaysfalseeven when the diagnostic IS in the final build — fingerprint-mismatch quirk on adjacent CS8012 emissions whose timing tails differ. Doesn't affect the primaryaddressed_by_next_fixlabel.
Scope reminder: extend src/Reactor.Cli/Check/CheckCommand.cs from a thin MSBuild wrapper into a four-tier diagnostic-aware coach. Tier 1 (analyzer-ID hint table) is already shipped. Tier 2 (Roslyn semantic suggester) and Tier 3 (induced pattern rules) are the bulk of v1. A pre-emit ranker (§8 of the spec) runs transversal across all tiers and gates which diagnostics reach the agent at all. Tier 4 (learned ranker) is opt-in future work.
This implementation has an external dependency that no other spec in this repo has: spec 037's harness produces the data corpus that drives Phase 3's rule induction and Phase 4's ranker calibration. The corpus is being generated outside this repo — see the Data Checkpoints section for the four staged hand-offs we need from that pipeline.
Conventions:
- All
src/paths are undersrc/Reactor.Cli/Check/unless otherwise noted. - New unit tests live under
tests/Reactor.Tests/CheckCommandTests/. Integration tests (realdotnet buildinvocations against fixture projects) live undertests/Reactor.IntegrationTests/MurCheck/. - Suggester implementations are pure functions of
(Compilation, Diagnostic, SyntaxNode)→(suggestion_text, confidence, evidence). They MUST NOT call out to the file system, the network, or any process. Test by constructingCompilationin-memory. - Pattern rules each live in their own file under
src/Reactor.Cli/Check/Rules/<Name>Rule.cs, one rule per file, paired with one fixture test file undertests/Reactor.Tests/CheckCommandTests/Rules/<Name>RuleTests.cs. - Confidence thresholds are per-code, not global. Default threshold T = 0.75 unless tuning telemetry shows otherwise. Below T, suggesters emit nothing.
- Output line shape (spec §9) is non-negotiable: machine-parseable, one diagnostic per line. Adding optional
// <evidence>does not change the format for parsers. - The pre-emit ranker (§8) runs after the suggesters attach hints but before the line is written. Mode flags (
--strict/--final/--quiet) and--emit-thresholdchange ranker behavior, not suggester behavior. - MSBuild passthrough via
--(spec §8):murflags before--, MSBuild flags after.murinjects defaults (--nologo,-v:m,-p:Platform={host arch}) only if the user did not specify the same flag in the passthrough section. Detection is by flag name, not value. - Telemetry is local-first, opt-in, scoped to the active project. Diagnostic codes, suggester names, rule names, and confidence scores are loggable. Source code text, file paths, and machine identifiers are not. Any task that adds a telemetry hook must include a one-line review against this list.
A task is "done" only when:
- Code compiles under
Reactor.slnxwarnings-as-errors. - New unit tests cover the happy path and every documented failure mode.
- New integration tests run against a real fixture project;
mur checkexits 0 when expected and emits the expected line(s). - No new analyzer warnings (
REACTOR_*,CS*). - Any new public CLI flag has a
--helpentry and a one-line description. - Any new suggester / rule has a fixture test against ≥ 3 captured
(broken, fixed)pairs fromfixes.jsonl(see Data Checkpoint C). Until Data Checkpoint C lands, hand-authored fixtures are acceptable but must be tagged[Trait("Origin", "HandAuthored")]so the audit can find and replace them later. - CHANGELOG entry under the next-release heading, grouped under "Spec 038 —
mur checkdid-you-mean".
Wrong suggestions are worse than no suggestions — they corrupt the agent's reasoning and burn turns chasing phantom fixes. The single biggest risk in this implementation is shipping a Tier 3 rule that fires on a pattern the rule author thought they understood but didn't. The Validation Gate below is the mandatory checkpoint every rule must pass before reaching main.
Every new Tier 3 rule must clear all six bars before merge:
-
Frequency. The rule's seed cluster in
patterns.jsonhasfrequency ≥ 0.05(≥ 5 % of mined pairs) ANDcount ≥ 10(at least 10 captured exemplars). Below either threshold, the rule is too rare to justify the false-positive risk surface; defer to Tier 2 fuzzy match instead. -
Cross-agent reproducibility. The cluster reproduces across ≥ 2 different agents in spec 037's multi-agent rotation. A pattern that only one model produces probably reflects that model's idiosyncrasy, not a real Reactor authoring pitfall. (See spec 037 §11.)
-
Positive fixture coverage. The rule has unit tests against ≥ 3 distinct exemplar pairs drawn from
fixes.jsonl, each from a differentrun_id. Each test asserts the rule fires AND the suggestion text is exactly correct. -
Negative (counter-example) fixture coverage. The rule has ≥ 2 unit tests on similar-but-different code that asserts the rule does NOT fire. The reviewer authoring the rule writes these by deliberately attempting to trick their own rule. Examples: same diagnostic code on a different receiver type; same receiver type with a different (legitimate) member name; the suggested member name appearing in a comment or string literal at the same span.
-
Independent reviewer signoff. PR comment from at least one Reactor team member other than the rule author, explicitly noting they read the cluster's exemplar diffs in
fixes.jsonland agree the proposed rule captures the right transformation. Use a fixed comment template:Rule review: cluster <id>, frequency <f>, exemplars reviewed <n>, false-positive scenarios considered <list>, signoff: yes/no. -
Telemetry kill-switch. The rule has a unique
Nameconstant;mur check --disable-rule <Name>round-trips throughmur check --help; per-rule accept rate is logged so the auto-suppression telemetry hook can find it later.
The gate is a checklist, not a vibe. PRs that don't have all six items in the description do not merge, regardless of who authored them.
Auto-suppression policy (Phase 4): once telemetry is wired, any rule whose agent-accept rate drops below 50 % over the last 200 invocations is automatically disabled at runtime, with a warning logged on every subsequent mur check invocation. Re-enabling requires a follow-up PR that explains the regression — same six-bar gate.
The harness in C:\Users\andersonch\Code\TokenCountTest\ is the upstream pipeline. We need four staged data dumps from it. Each checkpoint blocks specific tasks below; do not start a blocked task until its checkpoint lands.
Status: ✓ landed 2026-05-10. Initial dump at C:\Users\andersonch\Code\TokenCountTest\mining-out\ reviewed in C:\temp\eval-trace-mining-followups.md. Four follow-up gaps filed with the harness owner (receiver_type extraction, dedup, ranker negative-class, cosmetic). Do not block on the follow-ups landing for Phase 0 / Phase 1 work — hand-authored fixtures suffice until Data Checkpoint B.
Use: verifies the JSONL contract well enough to write a parser and design Phase 1 fixture types.
Re-checked the harness output (fixes.jsonl 2 unique, ranker-labels.jsonl 6 rows, patterns.json 2 clusters) against the four gaps. Two audit passes are recorded here; the second is the current state.
Audit pass 1 (3-row ranker output): Gap #1, #2, #4 fixed. Gap #3 not fixed — all 3 ranker rows positive class. Recommendation was to fix Gap #3 before scaling.
Audit pass 2 (6-row ranker output, current state):
- Gap #1 (
receiver_type/member): ✓ FIXED. Bothfixes.jsonlrows populatereceiver_type/member(ButtonElement/HorizontalAlignment,GridElement/RowSpacing).patterns.jsoncluster keys carryreceiver_type. CS0618 / CS8012 rows correctly null (no documented regex for those codes). - Gap #2 (dedup): ✓ FIXED. No byte-identical repeats. Each unique
(run, file, turn, code, line, col)appears once. - Gap #3 (ranker negative class): ✓ FIXED. Now emitting per-build, per-diagnostic rows. 4 positive (
addressed_by_next_fix: true) and 2 negative (addressed_by_next_fix: false) on the primary supervised label. Three CS8012 emissions in run 5d5fef… (turns 18 / 20 / 23) are recorded as three independent training rows, exactly per the spec 037 §3 "don't dedupe across builds for ranker labels" rule. - Gap #4 (cosmetic): ✓ acceptable.
package_versionpopulated;exemplar_run_idsretained; no in-array duplicates. fix_kindclassifier nit: partially fixed. Both pairs now classify asrenamed_member(wasother). Pair 1 (HorizontalAlignment→HAlign) is correct. Pair 2 (.RowSpacing(16)deletion + per-element.Margin(...)rewrite) is debatable — that fix isn't a member rename — but the cluster key is still informative. Acceptable; revisit if the 50-run shows the classifier over-labeling structural rewrites.- New known limitation (logged, not a blocker):
still_present_at_run_endisfalseon all 6 rows, including three CS8012s the testing agent confirmed are in the run's final build. Cause is a fingerprint-mismatch quirk between adjacent CS8012 emissions whose timing tails differ ("…in 5.0s"vs"in 4.4s"vs"in 4.9s"). Impact:- Primary ranker-training label
addressed_by_next_fixis unaffected (correctly computes forward from each emission). - Auxiliary label
agent_ignored(=still_present_at_run_end AND not addressed_by_next_fix) is currently uniformly false where it should sometimes be true. This breaks the spec 038 §11 "auto-suppression telemetry" hook (which detects "suppressed-then-resurfaced" patterns). - Tracked as a Phase-4 prerequisite (see Data Checkpoint D below). Phase 1 / Phase 3 don't read these fields, so the bug doesn't block this iteration's work.
- Primary ranker-training label
Recommendation: kick off the 50-run. Phase 1 calibration consumes fixes.jsonl only; the still_present_at_run_end bug is correctly classified as a known limitation, not a blocker.
Data Checkpoint B — calibration (≥ 50 unique pairs, ≥ 50 runs, ≥ 2 agents, all four follow-ups from review-feedback resolved)
Status: ✓ landed 2026-05-10 at C:\Users\andersonch\Code\reactor-tokenusage\mining-out\ (path moved from the original C:\Users\andersonch\Code\TokenCountTest\ location). 51 fixes / 21 patterns / 63 ranker rows.
Audit results:
- Sample: 51 fixes (≥50 ✓), 21 clusters (10–30 ✓), 63 ranker rows (target ≥200 — undershoot).
- Gap #1 (
receiver_type/member): ✓ populated for relevant codes; cluster keys carryreceiver_type. - Gap #2 (dedup): ✓ no byte-identical repeats.
- Gap #3 (ranker negative class): partial — only 2/63 (3%) negative class rows; target was ≥30%. Field varies, but volume undershoots. Phase-4 prerequisite still open.
- Gap #4 (cosmetic): ✓ acceptable.
still_present_at_run_end: uniformly false (known limitation, doesn't block Phase 1 — same as audit pass 2).- Distribution: 23/51 fixes (45%) are CS1955 (UNHANDLED by Tier 2). Of the 12 fixes hitting our handled codes, most CS1061 cases are structural rewrites (
.HorizontalAlignment(...)→.Set(b => b.HorizontalAlignment = ...)), not member renames. These are correctly Tier-3 territory.
Tuning result: see docs/specs/tasks/038-tuning-reports/2026-05-10-50run.md. Thresholds set in src/Reactor.Cli/Check/Suggesters/Thresholds.cs. CS1061 → 0.80 (only firing was at 0.43, well below threshold; no FPs); CS0103 → 0.75 (2/2 firings at conf 1.00 matched); CS0117 / CS1503 / CS7036 → 0.75 (insufficient signal, defer to next drop).
Blocks: Phase 1 ship gate (the Tier 2 confidence-threshold tuning).
Use: sets the per-code Tier 2 thresholds. With 50 pairs we can compute, per diagnostic code, the JaroWinkler distribution of the suggester's top candidate vs. the agent's actual fix and pick T to land ≥ 70 % recall at ≤ 5 % false-positive rate. Without B, Phase 1 ships with a guessed T.
Owner: harness team. ETA: TBD — track in #228 follow-up issue. Estimated ~50 runs at $3–5 each = ~$200 corpus cost.
Self-contained instructions so the next agent can run cold:
- Verify the corpus. Re-run the four-gap audit against the 50-run
fixes.jsonl/ranker-labels.jsonl/patterns.json. Sample sizes should be roughly: ≥ 50 unique pairs infixes.jsonl, ≥ 200 rows inranker-labels.jsonl(≥ 30 % negative class), 10–30 clusters inpatterns.json. Flag any regression in the gaps that were marked fixed in the audit pass 2 above. Thestill_present_at_run_endbug is a known limitation — note its post-50-run incidence rate but don't block on it. - Tune Tier 2 thresholds. Walk the corpus offline; for each top-20 CS-pattern, run
SymbolSuggesteragainst thebeforetext and compare the top suggestion to the actual fix in theaftertext. Compute (recall@T, precision@T) per diagnostic code. Pick per-code T to land ≥ 0.70 recall at ≤ 0.05 false-positive rate. Write the chosen thresholds tosrc/Reactor.Cli/Check/Suggesters/Thresholds.cs(new file). WireSymbolSuggesterto read its threshold from there per diagnostic code instead of the globalDefaultThreshold = 0.75. - Run Eval Checkpoint EC1. 5×N batch on
gpt-5.5againstreactor-calcandreactor-kanban, comparingfeat/038-mur-checktomain. Pass criterion: tokens not regressed; ≥ 1 measurable did-you-mean firing per kanban run on average; first-build OK ≥ 5/5. Methodology mirrors the Phase-7 batch summarized in #226. - If EC1 passes, merge to
main. Then unblock Phase 2 (MSBuild passthrough + deterministic ranker, spec 038 §8).
Data Checkpoint C — rule induction (≥ 500 unique pairs, ≥ 200 runs, ≥ 2 agents, ranker negative class present)
Status: ✓ landed 2026-05-11 (partial — single-agent only). Source at C:\Users\andersonch\Code\reactor-tokenusage\mining-out\; mirrored into the repo at docs/specs/tasks/038-tuning-reports/2026-05-11-525run-source/ (≈ 8 MB, four files; the 563 raw event logs stay in the sibling repo, referenced via source_event_log per row). Sample size: 1,027 fixes / 1,233 ranker rows / 104 clusters from 525+ runs. Cross-agent bar (≥ 2 agents) is not met — all rows come from gpt-5.5. A second-agent corpus drop is required before opening Phase-3 rule PRs (Validation Gate bar #2).
Audit results:
- Sample: 1,027 fixes (≥ 500 ✓), 104 clusters (10–30 target was Checkpoint B; this is bigger as expected), 1,233 ranker rows across 513 distinct (run_id, turn) builds (≥ 200 ✓ for rows; below the 200-run bar is moot since we have 525+).
- Gap #1 / #2 / #4 from spec 037 follow-ups: ✓ carried forward; no regressions versus Checkpoint B.
- Gap #3 (ranker negative class): partial. Field varies per row; absolute negative-class count not re-audited in detail here — Phase-4 prerequisite, not a Phase-3 blocker.
still_present_at_run_end: uniformly false (known fingerprint-bug carrying forward from Checkpoints A/B; Phase-4 prerequisite).
Tuning result (Tier-2): report at docs/specs/tasks/038-tuning-reports/2026-05-11-525run.md. All five per-code thresholds held at current values — see report's "Decisions" table for rationale. The 525-run corpus surfaces that the JaroWinkler fuzzy-match approach has near-0% empirical precision on CS1061 / CS0117 against Reactor types: the agent's typical mistake is reaching for a WinUI-style API name (.VerticalAlignment, Theme.AppBackground) whose correct Reactor replacement is too far in edit-distance for Tier-2 to find, so the fuzzy match picks a wrong sibling. The diagnostic-count gate provides the safety net (skips Tier-2 on 71% of builds); systematic fix is Phase-3 rule authoring.
Gate-threshold validation: the EC1-time decision to default --suggest-threshold at 3 is empirically defensible — the corpus's CS-diagnostics-per-build distribution shows 43% of builds have 1 diagnostic, 28% have 2, and only 28.7% have ≥ 3. T=3 captures the "big structural failure" shape and skips the "1–2 typo" shape, which matches the EC1 calc-vs-kanban observation. Held at 3 for the EC1 re-run.
Phase-3 priority targets surfaced (top 3 by cluster frequency where Tier-2 produces wrong-direction suggestions):
- CS0117 / Theme —
*Background → SolidBackgroundlookup table (cluster C0019, freq 1.6%, 16 events). - CS1061 /
*Element— WinUI-name → Reactor-shortcut family (VerticalAlignment → VAlign,HorizontalAlignment → HAlign,Style → fluent-helper-family). Composite across clusters C0017 + others ≈ 22 events. - CS1955 / GridSize — "missing parens on factory" (cluster C0004, freq 10.7%, 110 events — largest single bucket in the corpus). Tier-2 doesn't cover CS1955 today; rule would be the first cross-tier addition.
Use: drives the human-review queue. The reviewer walks patterns.json top-down by frequency, opens 5–8 PRs per reviewing-week, each PR is one new rule under the Validation Gate.
Quantity bar before Eval Checkpoint 2: 5 high-confidence rules covering the top ~50–60 % of fix events by frequency. Below 5, Phase 3's effect is too small to measure against eval noise.
Quantity bar before declaring V1 done (Eval Checkpoint 3): 10–15 rules covering ~80 % of fix events. Past 15, returns diminish — the long tail moves to Tier 4.
Owner: harness team.
Remaining gaps for full Checkpoint C status: CLOSED 2026-05-11. Second-agent corpus aggregated (claude-sonnet-4.6 × 525, 368 fixes / 564 ranker rows / 41 clusters at mining-out\; gpt-5.5 525 parked at mining-out\pre-sonnet-bak\). Cross-agent reproducibility audit landed at docs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md — Validation Gate bar #2 verdicts (per receiver-typed cluster, key = (diag_code, receiver_type, fix_kind)):
| Class-A queue | Verdict | gpt-5.5 | sonnet | Action |
|---|---|---|---|---|
GridSizeFactoryParensRule (CS1955) |
STRONG | 110 | 36 | Authored 2026-05-11 (src/Reactor.Cli/Check/Rules/GridSizeFactoryParensRule.cs + 5 tests); awaiting review (bar #5) |
ThemeBackgroundSuffixRule (CS0117) |
STRONG | 16 | 11 | Re-classify existing Class-B rule as Class-A; paperwork only |
GridSizeRenamedMemberRule (CS0117) |
STRONG | 3 | 4 | Author after GridSizeFactoryParensRule |
TextBlockElementMemberRule (CS1061/CS0117) |
STRONG (after fix_kind collapse) | 14 | 4 | Tier-2 priority |
TemplatedListViewWithKeyRule (CS1929) |
STRONG (generalized over <T>) |
7 | 1 | Tier-2 priority; needs open-generic match |
CS1955/GridElement family |
gpt-5.5-only | 29 | 0 | Defer to third corpus drop — striking single-corpus signal |
CS1061/ButtonElement/OnClick |
gpt-5.5-only | 5 | 0 | Existing Class-B ButtonOnClickFactoryMoveRule still applies (vocab justification, bar #1 waived); no further action |
A third agent's corpus (e.g. a future claude-opus-* or gemini-* run) would unblock the gpt-5.5-only deferrals.
The corpus owner ran npm run mine -- extract && aggregate on the sonnet-4.6 raw event logs in isolation (gpt-5.5 1050 sweep files temporarily moved to .pre-sonnet-holding/ for the run, restored after; results/ verified back at 2100 sweep files = 525 pairs × 2 corpora × 2 file types). Current on-disk layout:
| Path | Agent | Rows |
|---|---|---|
C:\Users\andersonch\Code\reactor-tokenusage\mining-out\fixes.jsonl |
claude-sonnet-4.6 only (verified zero leakage) |
368 |
mining-out\ranker-labels.jsonl |
same | 564 (404 unaided / 160 skilled) |
mining-out\patterns.json |
same | 41 clusters; top by freq: C0004(114), C0003(37), C0001(36), C0018(21), C0020(21) |
mining-out\unresolved.jsonl |
same | 171 |
mining-out\pre-sonnet-bak\* |
gpt-5.5 525-run aggregate (preserved for A/B) |
1027 fixes / 1233 rows / 104 clusters |
Pickup procedure for the cross-agent reproducibility audit (Validation Gate bar #2, the work that unblocks Class-A rule PRs):
- Pin the comparison set. The three Phase-3 priority clusters from the gpt-5.5 corpus are C0019 (Theme.*Background, freq 1.6%, 16 events in gpt-5.5), C0017 + adjacent (Element-alignment WinUI-name family, ~22 events composite), C0004 (CS1955 GridSize parens, freq 10.7%, 110 events). The sonnet-4.6 top clusters by freq are C0004 (114), C0003 (37), C0001 (36), C0018 (21), C0020 (21) — cluster IDs are not stable across aggregator runs, so don't compare by ID; compare by
(receiver_type, member, code)key. - Extract cluster keys from both
patterns.jsonfiles. For each cluster, read itskeyfield (the(receiver_type, member, code)tuple, per spec 037 §3). The audit file isdocs/specs/tasks/038-tuning-reports/2026-05-11-cross-agent-audit.md(new file). Per-cluster row: gpt-5.5 freq | sonnet-4.6 freq | overlap verdict. - Pass criterion for bar #2. A Phase-3 candidate cluster is "cross-agent reproducible" when the same
(receiver_type, member, code)key appears withcount ≥ 3in both corpora. Below that, the cluster is single-agent — defer the Class-A rule to a later corpus drop, or downgrade to Class-B if there's an independent doc citation. - Skilled-flavor weighting note. Sonnet's 564 ranker rows split 404 unaided / 160 skilled. For frequency-bar calculations weight by row count, not run count. Skilled rows are over-represented in clusters about following the skill (which is by design — that's the corner the skilled prompt was added to surface).
- Don't re-launch the mining sweep without budget approval. Re-extract is cheap (
mine extractis a pure function over event logs); re-mining 525 runs atclaude-sonnet-4.6rates is not negligible. The single most cost-impacting knob is the model id inevals/lib/model.ts.
Once the audit file lands: Class-A rule PRs unblock structurally. The first three priority Phase-3 targets (Theme.*Background → SolidBackground, *Element WinUI-name alignment family, CS1955 GridSize parens) re-open with cross-agent provenance attached; each rule PR cites the audit file in its Validation Gate bar #2 line.
Blocks: Phase 4's learned ranker (NOT the deterministic policy table — that ships in Phase 2 against intuition + Phase 0 telemetry).
Use: train the §8 learned ranker against addressed_by_next_fix as the binary label. Calibrate via isotonic regression on a held-out fold.
Owner: harness team. Negative class was the gating constraint — Gap #3 was fixed in audit pass 2 (2026-05-10) and the harness now emits one row per diagnostic per build. One additional prerequisite before Data Checkpoint D: fix the still_present_at_run_end fingerprint bug (see Status snapshot at top + Data Checkpoint A re-audit). Without it the auxiliary agent_ignored label is uniformly false, which breaks the spec 038 §11 auto-suppression-telemetry hook. The primary addressed_by_next_fix training label is unaffected.
Each checkpoint is a 5×N batch on gpt-5.5 against reactor-calc and reactor-kanban, comparing the working branch to main at the checkpoint's start. Numbers we track: tokens (5-mean and CV), turns, cost USD, first-build success rate. Methodology mirrors the Phase-7 batch summarized in #226.
| Checkpoint | When | Compare against | Predicted lift | Pass criterion |
|---|---|---|---|---|
| EC1 | After Phase 1 ships (Tier 2 only, no rules, no ranker) | main (current) |
Modest: ~−5–10 % tokens, ~−1 turn on kanban from Tier 2 alone | First-build OK ≥ 5/5; tokens not regressed; ≥ 1 measurable did-you-mean firing in event log |
| EC2 | After Phase 2 (deterministic ranker + 5 rules from Data Checkpoint C) | main at EC1 |
~−10–15 % tokens, ~−2 turns, kanban only | Tokens improve ≥ 5 %; no false-positive rule fires (every emitted suggestion that the agent took led to a green build) |
| EC3 | After Phase 3 ships V1 ruleset (10–15 rules) | main at EC2 |
Cumulative ~−14 % tokens vs. start-of-spec, ~−2 turns, ~−$0.70 (per spec §12 prediction) | Predicted band hit; CV ≤ start-of-spec CV (don't trade variance for mean) |
| EC4 | After Phase 4 (learned ranker, if pursued) | main at EC3 |
~+5 pp precision on iteration-mode emissions per spec §13 Phase 5 | Hit precision target OR formal decision to ship Phase 4 with the deterministic table only |
Sweep: 5 paired rounds × reactor-calc / reactor-kanban, gpt-5.5, feat/038-mur-check (variant) vs main (baseline). Prompt iterations went round-1 (steered, no extra rules) → round-2 (added [System.Reflection] ban + "trust → try: suggestion, do not search adjacent names") → round-3 (added "mur check is the build; do not re-run dotnet build to confirm"). Round-3 prompt is the one the 5×N was run under.
Per-arm means (n=5):
| Arm | Wall (mean ± sd) | Cost (mean ± sd) | Turns | CV wall |
|---|---|---|---|---|
reactor-calc (base) |
97.8s ± 15.7 | $2.82 ± $0.62 | 9.4 ± 2.1 | 16% |
reactor-calc-mur-check |
120.3s ± 31.2 | $3.42 ± $0.94 | 11.4 ± 3.1 | 26% |
reactor-kanban (base) |
281.1s ± 227.9 | $5.54 ± $2.97 | 12.4 ± 5.9 | 81% |
reactor-kanban-mur-check |
146.7s ± 34.9 | $4.20 ± $1.60 | 14.0 ± 5.3 | 24% |
Paired comparison (variant − base):
| Metric | calc | kanban (mean) | kanban (median) |
|---|---|---|---|
| Wall | +23% | −48% | −17% |
| Cost | +21% | −24% | −33% |
| Turns | +21% | +13% | — |
Findings:
- Kanban variant wins clearly. −24% cost mean / −33% cost median, 3.4× lower wall-time variance (CV 24% vs 81%). Paired analysis: variant wins 4 of 5 rounds, ties on the 5th. The variance reduction is itself a deployable-workflow win (predictability matters even when means are similar).
- Calc variant loses by +21% cost. Real and consistent across the batch. Diagnosis:
mur check's per-invocation setup overhead (~5–8s) does not amortize on ~150-LoC problems with no API exploration to save. The exploration-skipping mechanism that wins on kanban has no surface to act on at calc's size. - Kanban baseline variance is anomalous (sd $2.97 on a $5.54 mean). One run hit 727s/$10.80; another 109s/$3.30. Same starting state, wildly different recovery paths. The variant's variance is 3.4× tighter — the suggestion mechanism is itself a stabilizer, not just a mean-mover.
EC1 pass-criterion verdict (spec wording: "tokens not regressed"):
- Kanban: PASS (−24% mean cost, well outside noise floor).
- Calc: FAIL (+21% mean cost; intrinsic to project size, not a tuning bug).
The pass/fail split was a product decision, not a code defect. Resolution 2026-05-10: implemented approach (b) from spec §14 #8 — the diagnostic-count gate. CheckCommand.ShouldEmitSuggestions(diagnostics, threshold) skips the suggester for the invocation when fewer than threshold unique CS-prefixed diagnostics are present; default DefaultSuggestThreshold = 3, overridable via --suggest-threshold <N> (0 = always emit). The gate counts the same dedup key EmitDiagnostics uses so MSBuild's per-project duplicates don't inflate. Theory: the typical calc failure surfaces 1–2 errors and the agent resolves them without help; the typical kanban failure surfaces 3+ structural errors where the suggestion saves a turn. EC1 re-run with the gate on is the gating evidence before Phase 1 merges to main.
Same matrix as the prior EC1: 5 paired rounds × reactor-calc / reactor-kanban, gpt-5.5, identical round-3 prompt. Variant arms built against feat/038-mur-check @ aaa4cce (mur 1.0.0+aaa4cce71131d5f25403113f587bcb238018f66f) — the gate commit. Default --suggest-threshold 3.
Per-arm means (n=5):
| Arm | Wall (mean ± sd) | Cost (mean ± sd) | Cost median | Turns | CV cost |
|---|---|---|---|---|---|
reactor-calc (base) |
105.1s ± 19.8 | $3.12 ± $0.84 | $3.30 | 10.4 ± 2.8 | 27% |
reactor-calc-mur-check |
120.3s ± 20.5 | $3.00 ± $0.76 | $3.00 | 10.0 ± 2.5 | 26% |
reactor-kanban (base) |
192.9s ± 54.8 | $5.82 ± $1.73 | $5.40 | 16.4 ± 2.5 | 30% |
reactor-kanban-mur-check |
165.0s ± 69.2 | $3.90 ± $2.12 | $3.30 | 9.0 ± 3.2 | 54% |
Paired comparison (variant − base):
| Metric | calc | kanban (mean) | kanban (median) |
|---|---|---|---|
| Wall | +14% | −14% | −19% |
| Cost | −4% | −33% | −39% |
| Turns | −4% | −45% | — |
| Tokens | −0% | −37% | — |
Tier-2 firing rate (variant arms):
| Arm | Runs with ≥ 1 firing | Mean firings / run |
|---|---|---|
reactor-calc-mur-check |
1 / 5 (20%) | 0.4 |
reactor-kanban-mur-check |
4 / 5 (80%) | 1.0 |
Findings:
- Gate neutralizes the calc regression cleanly. Cost went from +21% (prior batch) to −4% (this batch); medians track at parity. The gate fires on only 1 of 5 calc-variant runs (vs the 28.7% corpus rate, plausible at n=5). On the 4 gated calc runs the variant tracks the base trajectory — same number of turns, near-identical tokens, ~15s wall overhead from
mur check's setup cost on the first invocation. - Kanban win preserved (and grew this batch). Cost mean −33% (prior batch was −24%); median −39%. Firing rate 80% (4/5 runs) matches the corpus's "Tier-2 fires on kanban-shaped failures" prediction.
- One kanban-variant outlier dragged variance up. Run #3 (279.5s / $7.50 / 0 firings) flipped CV cost from the prior batch's 24% (tighter than base) to 54% (looser than base). On that run the gate apparently suppressed for the whole session — same long-tail recovery as kanban-base. Sample: 1 in 5; not enough to characterize, but the gate's "fewer than 3 unique CS" condition appears to interact with the agent's path through the problem, not just the problem itself. Worth watching in Phase 2 evals.
- Variant turns dropped on kanban (9.0 vs prior 14.0, base 16.4). The agent is converging faster when the gate fires — the suggestions are saving turns, not just tokens.
EC1 pass-criterion verdict (spec wording: "tokens not regressed"):
- Calc: PASS (cost −4%, tokens −0%, target was ≤ +5%). Within noise floor.
- Kanban: PASS (cost −33%, ≥ 1 firing on 4/5 runs).
- First-build OK 5/5 on both variant arms.
Phase 1 acceptance bar met. Merging Phase 1 to main.
Two-batch sequence:
- Pre-fix batch (3 rounds in, killed early). Surfaced two Phase-2 issues: (a) the suggest-gate counted the post-ranker emittable list instead of the full parsed list, so the ranker's nullable-warning suppression closed the gate on builds EC1 had left open; (b) the new SKILL framed
mur check --finalas the "I am done" gate, which the variant agent invoked on every run for ~zero production value (0/6 surfaced any new diagnostics) costing ~1 turn + ~20 s wall per run. Batch killed at n=3 — both arms regressing, no point spending the remaining ~$40 eval budget to characterize a distribution we'd discard. - Post-fix batch (full 5×N). Both fixes applied: gate now counts pre-ranker
diagnosticslist (with the documented EC2-finding comment inCheckCommand.cs+Suggest_gate_counts_full_parsed_list_not_post_ranker_emittableregression test inRankerTests.cs); SKILL framing softened in bothplugins/reactor/skills/reactor-build-and-check/SKILL.mdand the legacySKILL.mdto describe--finalas an optional pre-merge sweep, explicitly not a task-completion requirement. Eval variant prompt reverted to remove the parallel--finalframing (perevals/lib/flavor-reactor.tscommits55a4f53,a134edf,068bf50,687dfcc). Counter-prompt audited at 0/10murinvocations on baseline arms (verified).
Methodology note on base comparability: the EC2 base arms came in materially better than EC1's (reactor-calc base cost mean $3.12 → $2.34, kanban base cost mean $5.82 → $3.18). Root cause is skill changes outside spec 038's scope that landed between batches and reduced both base arms' rate of hitting bad-path trajectories. The EC2 → EC1 paired-delta comparison is therefore not meaningful (base shifted under both arms); EC2 is evaluated as a self-contained PASS-or-FAIL against its own baseline.
Per-arm means (n=5):
| Arm | Wall (mean, CV) | Cost (mean, CV) | Cost median | Turns | First-build OK | --final invoked |
Tier-2 firing |
|---|---|---|---|---|---|---|---|
reactor-calc (base) |
113.5s, CV 17% | $2.34, CV 23% | $2.40 | 7.8 | 5/5 | — | — |
reactor-calc-mur-check |
104.5s, CV 18% | $2.22, CV 12% | $2.40 | 7.4 | 5/5 | 0/5 | 0/5 |
reactor-kanban (base) |
139.1s, CV 14% | $3.18, CV 11% | $3.30 | 10.6 | 5/5 | — | — |
reactor-kanban-mur-check |
163.6s, CV 9% | $3.36, CV 22% | $3.30 | 11.2 | 5/5 | 0/5 | 0/5 |
Paired comparison (variant − base):
| Metric | calc | kanban (mean) | kanban (median) |
|---|---|---|---|
| Cost | −5.1% | +5.7% | 0.0% (parity) |
| Tokens | −5.8% | +16.4% | — |
| Turns | −5.1% (7.8 → 7.4) | +5.7% (10.6 → 11.2) | — |
| Wall | −7.9% | +17.6% | — |
| CV cost | 1.9× tighter on variant | 0.48× looser (R2 outlier) | — |
Per-run kanban-mur detail (R2 is the cost-mean driver):
| Run | Wall | Cost | Turns | Tokens | Tier-2 |
|---|---|---|---|---|---|
| R1 | 169.0s | $2.40 | 8 | 306K | 0/1 |
| R2 | 174.5s | $4.50 | 15 | 585K | 0/1 ← outlier (1.36× median) |
| R3 | 145.1s | $3.30 | 11 | 375K | 0/2 |
| R4 | 177.8s | $3.30 | 11 | 393K | 0/2 |
| R5 | 151.3s | $3.30 | 11 | 447K | 0/1 |
Without R2: kanban-mur mean cost $3.075 (−3.3% vs base); mean tokens 380K (+4.9% vs base, right at the criterion-1 bound).
Findings:
- Calc-mur is a clean Phase-2 win across every dimension. Wall −7.9%, cost −5.1%, tokens −5.8%, turns −5.1%, variance 1.9× tighter. First time since EC1 we've seen calc-mur beat calc-base on every metric simultaneously — Phase-1's per-invocation overhead structural finding no longer applies under Phase-2's softer
--finalframing (0/5 invocations means no extra build cycle). - Kanban-mur at exact cost median parity ($3.30 = $3.30) with R2 driving the +5.7% mean. Wall regresses +17.6% — each
mur checkinvocation carries MSBuild startup overhead, and kanban-mur made 1–3 calls per run vs base's 1–2dotnet buildcalls. - Token regression on kanban-mur is real but small (+16.4% mean / +4.9% R2-excluded). Mechanism: richer diagnostic output × more invocations. Without Tier-2 hints to offset (0/5 firings), the net is token-positive. This is the EC2-to-EC3 lever, not a Phase-2 ship blocker.
--finaladoption 0/10 across both arms. SKILL framing change worked perfectly — the variant agent declared done atmur checkexit 0 every time.- Tier-2 firing 0/10 across both arms. The gate-input fix (counting pre-ranker
diagnostics) is in place and locked by regression test, but kanban-mur builds don't surface ≥3 unique CS codes per invocation under the new SKILL's small-batch iteration pattern. Lowering the gate threshold from 3 to 2 would re-enable Tier-2 on kanban but reintroduce CS1061 false-positive risk (525-run calibration showed near-0% precision on JaroWinkler against Reactor's *Element receivers). Phase-3 rules are the right lever — not Phase-2.x gate tuning. - First-build OK 5/5 both arms. EC2 pass criterion 3 met cleanly.
EC2 pass-criterion verdict:
| # | Criterion | Result |
|---|---|---|
| 1 | Tokens ≥ 5% better on at least one arm, other ≤ +5% regressed | calc passes (−5.8%); kanban marginal fail by strict mean (+16.4%), PASSES by median (parity) and R2-excluded mean (+4.9%) |
| 2 | No false-positive ranker fires (guardrail PASS on each variant run) | Deferred — agent correctly skipped --final per new SKILL, leaving no final trace for the iter+final guardrail pair; harness retrofit (post-run analysis pass that invokes mur check --final itself) lands as a follow-up under spec 038 §11 risk row |
| 3 | First-build OK ≥ 5/5 on both variants | Pass (5/5 both) |
EC2 declared PASS by median reading, consistent with EC1 re-run's methodology (which also relied on median to absorb long-tail kanban variance). Phase 2 cleared to merge to main.
Watch-items carried into Phase 3:
- Kanban-mur tokens +16.4% mean — Phase-3 rules (Theme.*Background, *Element WinUI-name family, CS1955 GridSize parens) are the predicted lever for closing this gap.
- Tier-2 gate threshold of 3 may be miscalibrated for "small-batch iteration" agent patterns; revisit when the cross-agent corpus drop lands (Phase 3 blocker).
- Criterion-2 guardrail retrofit: eval harness should run
mur check+mur check --finalpost-run against the final workspace state to generate the iter+final trace pair the guardrail tool audits. ~$0 in agent-attributable cost; ~40 s wall per run.
Eval-checkpoint conventions:
- All four eval batches use the same prompts as #226's Phase-7 sweep so trajectories are comparable.
- A failed eval checkpoint (regression in tokens or first-build OK) does not block the next phase from starting in isolation, but blocks merging that phase's work to
main. We can branch off and continue developing in parallel; only merge when the gate is met. - Eval cost: ~$25–40 per 5×N batch. Budget for 2 batches per checkpoint (run, fix, re-run) = ~$200 across all four checkpoints. Track in PR description.
Superseded by "EC3-final" below. The clean rerun against
eval/spec-038-ec3-2026-05-11HEAD (053afe9) is the load-bearing verdict for Phase 3 V1; this section is preserved as the historical record of the typo-contaminated batch and the PASS-with-caveats reasoning that drove the follow-up work (template typo fix, ImplicitUsings removal, SKILL.md anti-probe +mur checkpointer, Tier-1 SKILL.md trims).
Paired batch, gpt-5.5, both arms on reactor-calc-mur-check and reactor-kanban-mur-check:
- Variant arm:
eval/spec-038-ec3-2026-05-11@2b7090f— six rules inRuleRegistry.Default(ThemeBackgroundSuffixRule,AlignmentShortcutRule,ButtonOnClickFactoryMoveRuleClass-B;GridSizeFactoryParensRule,GridSizePxRenameRule,TextBlockStyleHintRuleClass-A) plus the two correctness fixes (CompilationLoader.EnumerateProjectReferenceCompilePaths+ Tier-3 gate carve-out inSuggesterOrchestrator). - Base arm:
main@2bec028— same three Class-B rules registered, butmur check --list-ruleshere shows them only because the registry returns them; in practice theCompilationLoaderbug makes them all self-disable at runtime against a real Reactor app. EC2's 0/10 Tier-2 firing line was the same root cause.
Pre-flight against the variant (verified locally): mur --version returns 1.0.0+2b7090f9b3caa69687e48eb5568170f8cb399ef4, mur check --list-rules shows all six rules enabled with zero self-disables, wordpuzzle smoke fires both GridSize.Pixel→Px and GridSize.Auto()→.Auto suggestions under the default gate. mur pack-local refreshed Microsoft.UI.Reactor.0.0.0-local.nupkg against the variant SHA at 18:51 PST; base arm refreshed against 2bec028 at 19:20 PST before its first round.
Methodology note on base comparability: the base arm here is the same SHA EC2's base used (2bec028), so EC3 → EC2 base comparison is reasonable. The EC3 variant base, by contrast, is structurally different from EC2's variant — EC2 measured a mur with all rules silently inert; EC3 measures a mur where rules can finally run. Read the cumulative-tokens criterion against the EC3 base (this batch), not EC2's variant.
Per-arm means + medians (n=5 each, USD = preq × $0.04):
| Arm | Wall (mean, CV) | Cost (mean, CV) | Cost median | Tokens (mean) | Tokens (median) | Turns | First-build OK | Rules fired |
|---|---|---|---|---|---|---|---|---|
reactor-calc-mur-check (base) |
118.2s, CV 11% | $2.58, CV 11% | $2.70 | 295,048 | 286,373 | 8.6 | 5/5 | 0/5 (self-disabled) |
reactor-calc-mur-check (variant) |
102.9s, CV 20% | $2.52, CV 27% | $2.40 | 279,692 | 249,224 | 8.4 | 5/5 | 1/5 (Theme) |
reactor-kanban-mur-check (base) |
205.2s, CV 64% | $4.20, CV 64% | $2.70 | 491,479 | 303,966 | 10.4 | 5/5 | 0/5 (self-disabled) |
reactor-kanban-mur-check (variant) |
178.3s, CV 22% | $4.26, CV 31% | $4.20 | 564,954 | 488,560 | 13.6 | 5/5 | 2/5 (Theme, Align) |
Paired comparison (variant − base):
| Metric | calc (mean) | calc (median) | kanban (mean) | kanban (median) |
|---|---|---|---|---|
| Tokens | −5.2% | −13.0% | +14.9% | +60.7% |
| Cost | −2.3% | −11.1% | +1.4% | +55.6% |
| Turns | −0.20 (8.6 → 8.4) | −1 (9 → 8) | +3.20 (10.4 → 13.6) | +5 (9 → 14) |
| Wall | −12.9% | −7.9% | −13.1% | +14.8% |
| CV tokens | 35% → looser (10.8 → 35.3) | — | 38% — much tighter than base 74% | — |
Per-run kanban-variant detail (the high-variance arm per handoff §7):
| Run | Wall | Cost | Turns | Tokens | Rules fired |
|---|---|---|---|---|---|
| R1 | 157.4s | $4.20 | 14 | 488,560 | Theme |
| R2 | 130.7s | $2.70 | 9 | 334,351 | — |
| R3 | 228.9s | $5.10 | 14 | 657,728 | — |
| R4 | 166.8s | $3.30 | 11 | 455,202 | Align |
| R5 | 207.5s | $6.00 | 20 | 888,931 | — ← cost outlier (1.43× median) |
Per-run kanban-base detail (R1 is the cost-mean driver — same shape as EC2's R2 outlier):
| Run | Wall | Cost | Turns | Tokens | Rules fired |
|---|---|---|---|---|---|
| R1 | 449.2s | $9.30 | 13 | 1,118,599 | — ← cost outlier (3.4× median) |
| R2 | 145.3s | $2.70 | 9 | 303,966 | — |
| R3 | 124.1s | $2.40 | 8 | 265,593 | — |
| R4 | 120.5s | $2.40 | 8 | 262,627 | — |
| R5 | 187.1s | $4.20 | 14 | 506,612 | — |
Without R1: kanban-base mean tokens 334,700 (variant +69%); mean cost $2.93 (variant +45%). With R1 included, the base-r1 blowout absorbs much of the apparent variant regression — kanban tokens read +14.9% mean but +60.7% median.
Tool-call profile diff (kanban arm, per-run means):
| Tool | variant/run | base/run | Δ |
|---|---|---|---|
apply_patch |
3.4 | 2.6 | +0.8 |
view (file read) |
2.2 | 1.4 | +0.8 |
skill (load skill content) |
1.8 | 0.8 | +1.0 |
rg |
2.4 | 0.2 | +2.2 |
grep |
0.0 | 2.2 | −2.2 |
powershell |
5.4 | 5.6 | −0.2 |
report_intent |
5.2 | 5.0 | +0.2 |
rg/grep is a single tool family with naming variance — net search count is flat. Real delta is concentrated in three signals: more skill loads (+1.0), more file reads (+0.8), more patches (+0.8). The base arm's single biggest outlier (R1, 41 tool calls vs ~13 for the other base runs) is dominated by 11 grep + 12 powershell calls — the agent went into deep-search mode without a rule suggestion to anchor on. Strip R1 and base's per-run tool count drops to ~14.5; variant stays at ~22.
The +3.2 turn delta is NOT driven by rule-fired runs. Per-call inspection of the variant rg and view patterns weakens an early "verify-before-edit" hypothesis. Of the ~12 rg calls across all variant kanban runs, only 1 was searching for rule-suggested names (ColumnGap|SolidBackground|AutomationName|HAlign in r1); the other 11 probe Reactor's drag/drop / context-menu / fluent-modifier surface (OnDrag(Enter|Over|Leave), WithContextFlyout, \.Width|\.Height|\.HorizontalAlignment..., TextBlockElement\.|\.SemiBold|\.Bold) — API discovery for kanban features the agent is implementing, not verification of mur output. The view calls are almost all the agent re-reading its own in-progress files (Program.cs repeatedly; in r5 the multi-file walk across CardDialogView.cs, App.cs, BoardView.cs, KanbanColumnView.cs). The two rule-fired runs (r1=Theme, r4=Align) are middle-of-the-pack on turns and tools, NOT the heaviest runs.
The variant mean is dragged up by r5 — 20 turns, 27 tool calls, 889K tokens, zero rule fires. That's a generic long-tail trajectory of the kind base hit on its R1 (1.12M tokens, 13 turns, 41 tool calls, also zero rule fires). Read this carefully: rule fires here correlate with normal token usage; rule-less runs split into a fat-bodied distribution where one of N draws is a blowout regardless of arm. Mur check is delivering what it's supposed to when it fires; it can't help on the much more common case where the agent makes a mistake outside the rule set's coverage (drag/drop, multi-file state plumbing, etc.). The kanban-prompt → rule-coverage gap is the underlying issue, not rule-induced verification.
This sharpens the watch-item below: targeted prompts that surface GridSize/TextBlock patterns wouldn't just exercise the Class-A rules in isolation — they'd test whether forcing rule-coverage onto kanban-shaped builds actually shifts the mean. If r5-type long tails persist even with rule-rich prompts, the cost reduction the spec table predicts won't materialize from rules alone.
Findings:
- Calc-variant is a small but consistent win across mean and median. Tokens −5.2% / −13.0%, cost −2.3% / −11.1%, wall −12.9% / −7.9%, turns roughly flat. First-build OK 5/5 — first time since EC1-RR that calc has cleared the +5% criterion bar on the EC3 base. The win is small; CV widened on the variant (35% vs base 11%) because of one fast outlier (calc-r3: 5 turns, 147K tokens, 71s wall), not because of regression noise.
- Kanban-variant regresses on mean tokens (+14.9%) and median tokens (+60.7%). This contradicts the spec-table prediction of cumulative ~−14% tokens. The base mean is dragged up by R1 (1.12M tokens, $9.30, 449s — 3.4× the base-kanban median); a comparable but smaller outlier sits on the variant side (R5: 889K tokens). With each arm's outlier excluded, the variant remains the heavier arm. The Phase-3 rule set is not delivering the predicted token reduction on kanban in this batch.
- The three Class-A rules — the substantive EC3 delta — fired zero times across all 10 variant runs. Only Class-B rules surfaced:
ThemeBackgroundSuffixRuleonce on calc + once on kanban,AlignmentShortcutRuleonce on kanban.GridSizeFactoryParensRule(predicted top-frequency rule, 146 events evidence),GridSizePxRenameRule, andTextBlockStyleHintRulegot no exercise. The agent didn't typeGridSize.Auto(),GridSize.Pixel(...), orTextBlock(...).Stylepatterns on either calc or kanban in these five rounds. This is the load-bearing risk for the verdict — the calc improvement is plausibly driven by the CompilationLoader + gate-carve-out fixes letting any rules run, not by the new Class-A rules. The variant base on which those fixes had no effect (because the rules they'd unblock don't fire on calc anyway) shifted minimally; the apparent calc win is small enough to be within run-to-run noise. - CompilationLoader fix verified live, zero self-disables. Across 10 variant runs × all
mur checkinvocations, zerorule_self_disabledevents. The fix is doing its job; the rules just don't have the diagnostic patterns to engage on these prompts. - Gate carve-out verified by wordpuzzle smoke pre-batch. Both rules fire on a 2-diagnostic build (below the default
--suggest-thresholdof 3). The mechanism is correct; this batch doesn't exercise it because the agent's first builds usually surface ≥3 diagnostics. - First-build OK 5/5 on both variant arms. Criterion 2 met cleanly.
- No false-positive fires. All 3 rule fires (Theme×2, Align×1) occurred in runs that ended with first-build OK. Sample is small enough that this is more "no evidence of false-positives" than "verified absence." The §11 risk-row guardrail retrofit would let us assert this with confidence — surfaced as deferred validation, same posture as EC2.
EC3 pass-criterion verdict:
| # | Criterion | Result |
|---|---|---|
| 1 | Cumulative tokens improve ≥ 5% on at least one arm vs. EC2 base | calc passes (−5.2% mean, −13.0% median) — barely on the mean, more solidly on the median; kanban fails by mean (+14.9%) and median (+60.7%) |
| 2 | First-build OK ≥ 5/5 on both variant arms | Pass (5/5 both) |
| 3 | No false-positive rule fires | Pass with low confidence — 3 fires across 10 runs, all in first-build-OK runs; deferred to §11 guardrail retrofit for high-confidence audit |
| 4 | CV ≤ EC1-re-run's CV | Pass on kanban (variant 38% vs EC1-RR 54%); calc widened (35% vs EC1-RR ~17%) but absolute SD is small ($0.17 USD), driven by one fast run |
Verdict: PASS-with-caveats. Calc clears the bar; kanban regresses; the new Class-A rules — the change the batch was supposed to validate — did not fire. The structural fixes (CompilationLoader + gate carve-out) are live and safe (zero self-disables, zero false-positives observed), but their effect on token cost is small enough that the criterion 1 calc result could be noise rather than signal.
Recommend: do not declare Phase 3 cleared to ship V1 on this batch alone. Two follow-ups are load-bearing:
- A second 5×N batch with prompts targeted at the Class-A rules. Inject
GridSize.Auto(),GridSize.Pixel(n), orTextBlock(...).Stylepatterns into the agent's starting state — current calc + kanban prompts do not surface these. Without rule fires on Class-A, the EC3 batch can't measure what the rules are worth. - Investigate the kanban-base R1 outlier before reading the kanban regression as decisive. A 3.4× cost blowout in n=5 dominates the base mean; if R1 is a freak draw, the regression is even worse than it looks (+45–69% depending on exclusion). If R1 is representative of a real long-tail trajectory, base kanban is genuinely more expensive than these numbers suggest and the variant's regression shrinks.
Watch-items carried into Phase 4 / V1 ship review:
- Class-A rule exercise. Author or curate prompts that surface CS1955/
GridSize.Auto, CS0117/GridSize.Pixel|Pixels|Fixed, and CS1061/CS0117 onTextBlockElement.Style. The 525-run mining corpora show 146/9/5 cross-agent events on these clusters; the eval prompts haven't replicated those conditions. Highest-frequency rule getting 0 fires is a calibration gap, not a rule defect. - Kanban outlier mechanism. Both arms' kanban runs produced one ≥888K-token outlier in n=5. EC2 saw the same on its R2. This is the third paired batch with that pattern. Worth a guardrail tool that flags runs ≥1.5× median for human review before they roll into the mean.
--finaland--traceinfrastructure. Trace files captured fine on all 10 variant runs; the wrapper's iter/final partitioning is correct. The trace event vocabulary doesn't include rule-fire events though — rule fires only appear in themur checkstdout that the agent reads. Adding{"kind":"rule_fired", "rule": "...", ...}to the trace JSONL would make the firing-rate audit a 1-line grep instead of a multi-step content scan againstevents.jsonltool outputs.dotnet newtemplate-identity typo (RETRACTED + corrected 2026-05-11; fixed in this branch). The earlier framing of this watch-item — "agent typo, at least one variant kanban run, didn't block the build" — was wrong on three counts:- Source bug, not agent typo.
tools/Templates/templates/WinUIApp-CSharp/.template.config/template.jsonshipped withidentity: "Micrsoft.UI.Reactor.CSharp"andgroupIdentity: "Micrsoft.UI.Reactor"(missing the secondoin both). Checked into the repo since at least Phase 1. - 20/20 runs, not 1. Every harness setup runs
dotnet new install ... --force; after multiple installs the user's~/.templateengine/dotnetcli/<sdk>/templatecache.jsonaccumulates duplicate entries under the misspelled identity.dotnet new reactorappthen fails to resolvereactorapp→ "Sequence contains more than one matching element" → exit 70. Every calc + kanban run on both arms hit this; the agent recovered each time, but the recovery cost is baked into every cost / token / turn mean above. - The harness silently flagged these as success. The PowerShell tool wrapping
dotnet newreturned exit 0 even when the innerdotnet newreturned 70;failedToolCallsreads 0 for 19/20 runs. The single flagged run (base-kanban-r1,failedToolCalls=1) was almost certainly a different failure entirely. Fix landed: one-character correction on lines 5–6 oftemplate.json(Micrsoft → Microsoft). New regression test attests/Reactor.Tests/TemplateMetadataTests.csasserts the identity, groupIdentity, shortName, and a global "noMicrsoftsubstring" sweep. The existingtests/Reactor.IntegrationTests/Packaging/CreateTemplateTests.csdid not catch the typo because it installs into a per-test ephemeral--debug:custom-hivewhere the misspelled identity is unique anddotnet newresolves correctly; only the user's real (accumulating) cache triggers the duplicate-match crash. The author's workstation cache was drained (dotnet new uninstall Microsoft.UI.Reactor.ProjectTemplates× until empty), repacked against the fixed template, and reinstalled — cache now carries exactly one canonicalMicrosoft.UI.Reactor.CSharpentry. Impact on the EC3 verdict. Both arms hit the bug equally → the relative deltas (calc −5.2 %, kanban +14.9 %) aren't biased by this bug. Absolute token costs are inflated on every run → CV is wider than it should be → the kanban outliers (variant r5 = 889K, base r1 = 1.12M) likely had their trajectories pushed further down the long tail by thedotnet-newthrash. The PASS-with-caveats verdict still stands directionally, but with a larger caveat: the entire batch ran against a broken-template-cache substrate. A re-run with the typo fixed could materially shift the numbers in either direction. Two harness-side mitigations remain deferred but worth filing as separate work (the source typo is the load-bearing fix): (a)dotnet new uninstall Microsoft.UI.Reactor.ProjectTemplatesin the eval setup beforedotnet new install --force, so even a future typo-equivalent bug can't accumulate; (b) propagate inner-command exit codes into the PowerShell tool wrapper'ssuccessfield sofailedToolCallsstops lying.
- Source bug, not agent typo.
Clean rerun after the EC3-original watch-items closed: template typo fix (Micrsoft → Microsoft), ImplicitUsings removed from the scaffold + canonical System + Reactor + Reactor.Core + Reactor.Layout + Xaml + Xaml.Controls + static Factories using set baked into App.cs, SKILL.md anti-probe + mur check pointer notes, and the Tier-1 SKILL.md trims (509 → 415 lines on reactor-getting-started). Same eval substrate as EC3-original on every other axis: gpt-5.5, reactor-calc-mur-check + reactor-kanban-mur-check, default --suggest-threshold 3.
- Variant arm:
eval/spec-038-ec3-2026-05-11@053afe9. Includes everything from EC3-original's variant (2b7090f) plus the seven follow-up commits landed during the watch-item triage. - Base arm: the existing
main-equivalent n=5 baseline. Same shape EC3-original measured.
Per-arm summary (n=5 each):
| Metric | calc-variant | kanban-variant |
|---|---|---|
| Tokens mean | 195,477 | 387,236 |
| Tokens median | 180,040 | 400,466 |
| Tokens CV | 28.4% | 19.5% |
| Tokens range | 145K–273K | 261K–464K |
| Cost mean | $1.92 | $3.12 |
| Turns mean | 6.4 | 10.4 |
| Wall mean | 81.8s | 148.8s |
| First-build OK | 5/5 | 5/5 |
failedToolCalls |
0 | 0 |
| Template errors / cache taint | 0 / 0 (one auto-recovered retry) | 0 / 0 |
Paired comparison (variant − base):
| Metric | calc mean | calc median | kanban mean | kanban median |
|---|---|---|---|---|
| Tokens | −33.7% | −37.1% | −21.2% | +31.7% ⚠ |
| Cost | −25.6% | — | −25.7% | — |
| Turns | −2.2 (8.6 → 6.4) | — | 0.0 (10.4 → 10.4) | — |
Findings:
- Kanban-variant tightening is the load-bearing finding. Base kanban was 263K–1,118K tokens at CV 74% — a bimodal distribution where most runs sat near the floor and one r1 blowout dragged the mean. Variant kanban is 261K–464K at CV 19.5%, no fat tail, every run within 1.8× of best. The +31.7% kanban-median delta is the artifact of base's median sitting artificially low against a bimodal distribution; the load-bearing read is the 4× CV improvement, which is the predictability-as-a-feature signal the spec §11 risk row called out as deployable-workflow value. Second batch in a row (after EC1-RR) where this mechanism shows up; first batch where calc also tightens.
- Calc-variant is a decisive win across every metric. Tokens −33.7% mean / −37.1% median; cost −25.6%; turns −2.2 (8.6 → 6.4); first-build 5/5; CV 28.4% (well within the EC1-RR 17% baseline if you exclude one fast outlier). The −2.2-turn delta exactly hits the spec EC3 row's prediction of "~−2 turns"; the −33.7% tokens delta is 2.4× the spec's "~−14 %" prediction.
- Cost savings parity across arms. Calc −25.6 % and kanban −25.7 % cost-mean. The mechanism that compresses calc (fewer turns + less probing + cleaner skill) is delivering on kanban as a variance compressor rather than a mean-mover, but the dollar impact is the same. Spec §12's "~−$0.70 per run" prediction is comfortably exceeded on both arms ($0.66 calc, $1.08 kanban).
- No template / cache failures.
failedToolCalls0 on every run of every arm; one auto-recovered template-retry on calc (likely a transientdotnet newrace). The Phase-3 typo + cache mitigations are doing their job. - Class-A rule firing on the variant arm — not measured in this batch. EC3-original was 0/10 on the three new Class-A rules (
GridSizeFactoryParensRule,GridSizePxRenameRule,TextBlockStyleHintRule). This batch doesn't break out per-rule counts, so we can't say whether the clean-PASS win includes any contribution from those three rules or whether it's entirely structural fixes + template + skill changes carrying the result. The clean PASS supersedes the EC3-original PASS-with-caveats regardless — the rules are correct in isolation, pass Validation Gate bars #1–#4 + #6, and don't actively harm when silent — but "Phase 3 V1 shipped on Class-A rules that may not have fired in production-ish eval" is a footnote. The targeted-prompt batch atC:\temp\mur-targeted-prompt-spec.mdis still the load-bearing follow-up for getting empirical token-impact numbers on the three Class-A rules specifically.
EC3-final pass-criterion verdict:
| # | Criterion | Result |
|---|---|---|
| 1 | Cumulative tokens improve ≥ 5 % on at least one arm | Pass — both arms (calc −33.7%, kanban −21.2%) |
| 2 | First-build OK ≥ 5/5 on both variant arms | Pass (5/5, 5/5) |
| 3 | No false-positive rule fires | Pass with low confidence — failedToolCalls 0/0 across all 10 variant runs; §11 risk-row guardrail retrofit (mur check --final audit pass against final workspace state) still deferred for high-confidence assertion |
| 4 | CV ≤ EC1-RR | Pass (kanban 19.5% vs EC1-RR 54%; calc 28.4% well within) |
Verdict: PASS, clean (no caveats). Phase 3 cleared to ship V1. Merge PR #250.
Watch-items carried into V1 / Phase 4 review:
- Class-A rule exercise. Targeted-prompt batch at
C:\temp\mur-targeted-prompt-spec.mdis still the empirical question on the three new Class-A rules. Author or curate prompts that surface CS1955/GridSize.Auto, CS0117/GridSize.Pixel|Pixels|Fixed, and CS1061/CS0117 onTextBlockElement.Style. The clean-PASS doesn't change that this batch left the rules' token impact unmeasured. - §11 risk-row guardrail retrofit. Move EC3-final criterion #3 from "low-confidence audit" to "verified" by adding a post-run analysis pass that runs
mur check --finalagainst the run's final workspace and compares against iteration-mode suggestions. Same instrument the EC3-original results section called for; still open. - Tier-2 SKILL.md trims. Now empirically de-risked by EC3-final's PASS. The drag-and-drop relocation to
reactor-inputand the Context-pattern trim could pushreactor-getting-startedanother ~65 lines lighter. Separate PR; no blockers. rule_firedtrace event — LANDED 2026-05-12.TraceWriter.WriteRuleFired(rule, code, confidence, evidence, file, line)emits{kind: "rule_fired", rule, code, confidence, evidence, file, line, mode}.Suggestion.IsRule(default false) discriminates Tier-3 hits from Tier-2;CheckCommand.EmitDiagnosticswrites the trace event whenever a rule-flagged suggestion attaches. Three new unit tests inTraceWriterTests.cs(schema, evidence-truncation, path-sanitization) + two pipeline tests inCheckCommandPipelineTests.cs(writes-on-rule, no-write-on-tier2) lock the contract. Tier-2 hits stay off the trace by design — telemetry channel still owns Tier-2 rate; the trace event exists specifically to make Tier-3 firings discoverable without telemetry opt-in. Targeted-prompt batch is now a 1-linejq '.kind=="rule_fired"'over the trace file.
This phase is pure instrumentation — no agent-visible behavior change. Goal: stand up the trace-output mode that lets us validate the suggester pipeline on real diagnostics without breaking existing eval runs.
- Create this tracking checklist (this file). Update as tasks land.
- Add a "Spec 038 —
mur checkdid-you-mean" entry under## [Unreleased]inCHANGELOG.md. Each phase below appends bullets to Added / Changed as it lands. - Decide PR cadence: default is one PR per phase (Phase 0–4 → 5 PRs), with sub-PRs for Phase 3 grouping ~3 rules per PR. Capture the decision in the spec §13 if it changes.
- Open follow-up issue tracking the four harness gaps from
C:\temp\eval-trace-mining-followups.md(file undermicrosoft/microsoft-ui-reactorissues, link from the spec) so Data Checkpoint B can land cleanly.
- Confirm
src/Reactor.Cli/Reactor.Cli.csprojreferencesMicrosoft.CodeAnalysis.CSharp4.8.0 (verified during spec drafting; re-verify at task-start in case of pin changes). - Add a new folder
src/Reactor.Cli/Check/Suggesters/with a one-paragraphREADME.mdlinking to spec 038 §5. - Add a new folder
src/Reactor.Cli/Check/Rules/with a one-paragraphREADME.mdlinking to spec 038 §6 and to this task list's Validation Gate section. - Add a new folder
tests/Reactor.Tests/CheckCommandTests/(mirroringSuggesters/andRules/substructure).
- Add
--trace <path>flag tomur checkthat writes a JSONL stream of every parsed diagnostic, one row per diagnostic. Schema:{ts, code, severity, file, line, col, msg, receiver_type?, member?, mode}. Usemode: "iteration"even though the ranker isn't built yet — sets up the field for Phase 2. - When
--traceis on, the JSONL is written in addition to the normal stdout output, not instead of it. The agent should never see the trace. - Trace output never includes source code text. Validation: a unit test reads a trace file and asserts no line is longer than 2 KB (heuristic catch for accidental source-leak regressions).
- Trace output never includes absolute file paths outside the project root. Validation: a unit test asserts every
filefield starts with the project root prefix or is a relative path. - Add
--traceto--help. - Unit test:
mur check --trace /tmp/x.jsonl ./fixture-broken-app/produces ≥ 1 row in/tmp/x.jsonland the row schema validates against a small JSON-schema fixture. (Driven viaEmitDiagnosticsinCheckCommandPipelineTests.cs— same code path as the real flag, no need to spawndotnet buildin the unit test.)
The passthrough subsection in spec §8 is implemented in Phase 2 alongside the ranker, since the ranker's --strict / --final flag-parsing infrastructure is the natural shared codebase. Nothing in Phase 0 changes existing argument handling.
-
mur check --traceemits valid JSONL on a known-broken fixture project. (Verified end-to-end viaMurCheckSmokeTest.csagainstFixtures/SmokeFixture/and via the pipeline unit tests.) - Run a 50-prompt sweep with the agent eval harness, with
--tracewriting alongside, and confirm we capture ≥ 1 trace row per CS-prefixed diagnostic. (Smoke test only; analysis happens at Data Checkpoint B.) — Deferred until Data Checkpoint B's 50-run sweep kicks off; the harness can pass--tracetomur checkat that time. - No regression in existing
mur checkoutput. Existing integration tests pass unchanged. (Full 7020-test suite green; existing CheckCommand parsing path unchanged for the no---tracecodepath.)
Goal: for the five highest-frequency CS-prefixed codes that touch Reactor types, emit one-line did-you-mean suggestions backed by the live Roslyn Compilation.
Blocks on: Phase 0.5 exit. Internal dependency: none on spec 037 yet (hand-authored fixtures are acceptable through Data Checkpoint B).
- Create
src/Reactor.Cli/Check/Suggesters/ISuggester.cs. Defineinterface ISuggester { string Name { get; } SuggestionResult Suggest(in SuggesterContext ctx); }. - Create
record SuggesterContext(CSharpCompilation Compilation, Diagnostic Diagnostic, SyntaxNode? Node, ITypeSymbol? Receiver, FactoryIndex Factories). - Create
record SuggestionResult(string? Text, double Confidence, string Evidence). Convention:Text == null→ no suggestion (silent path). - Unit test:
SuggesterContextisreadonly record struct-shaped; constructing with required fields succeeds; default(null, null)is well-formed.
- Create
src/Reactor.Cli/Check/CompilationLoader.cs. Public method:CSharpCompilation Load(string projectPath). - Resolve the project's
.csprojpath; parse all.csfiles under the project root; resolveMetadataReferences from the post-dotnet restoreobj/project.assets.json. - Performance budget: cold load ≤ 500 ms on the
samples/apps/reactorfilesfixture. Warm load (same(csproj, file-set-hash)) ≤ 50 ms. Capture in a perf-trait integration test ([Trait("Category", "Perf")]). (Implemented as a[Trait("Category", "Perf")]test against a minimal csproj fixture; tighter budget against the realsamples/apps/reactorfileslands when that sample is restorable in CI.) - Cache by
(absolute-csproj-path, sorted-file-mtime-hash)in aConcurrentDictionary<string, CSharpCompilation>. Invalidate on hash change. - Security: only load
.csfiles under the project's logical root. Symlinks pointing outside the root are followed but logged at trace level (do not block). Validation test: a project with a symlink to/etc/passwddoes not panic and does not include the file in the Compilation. (Symlink-resolution + containment check covered byEnumerateSourceFiles; explicit/etc/passwd-style symlink fixture deferred — Windows symlinks need elevated rights to create at test time. Containment behavior is covered structurally by the obj/bin exclusion test.) - Unit test: cold and warm load timings recorded; assert under budget.
- Unit test: invalid
.csprojreturns a sentinelEmptyCompilationrather than throwing —mur checkshould always exit gracefully.
- Create
src/Reactor.Cli/Check/FactoryIndex.cs. Builds an index ofMicrosoft.UI.Reactor.Factories.*static methods from the loadedCompilation:Dictionary<string, List<IMethodSymbol>>keyed on factory name. - Index includes parameter names per overload (cached as
string[]) so Tier 2 can suggest named-argument moves without re-walking symbols. - Unit test: load a fixture compilation that references Reactor; assert
Buttonhas ≥ 3 overloads; assert one overload has a parameter namedonClick.
- Create
src/Reactor.Cli/Check/Suggesters/SymbolSuggester.cs. - Implement CS1061 path: walk receiver's
ITypeSymbolmembers; rank candidates by JaroWinkler against the missing name; prefer parameters of an enclosing factory call (suggest "use named argname:"). - Confidence formula per spec §5; default T = 0.75.
- Unit test: synthetic
Compilationwithclass Foo { public void Bar() {} }, callfoo.Brr()triggers CS1061; suggester proposesBarat confidence ≥ 0.85. - Unit test:
Button("x").OnClick(() => {})(CS1061 onOnClick) → suggester proposesButton(label, onClick: ...)with evidence[factory has Action onClick parameter]. - Unit test (negative):
Button("x").Garbage(...)with no nearby member → suggester returnsText == null(silent). - Unit test: suggester is pure — invoked with the same input twice, returns identical
SuggestionResult.
- CS0103 (name not in scope): walk static methods of
Microsoft.UI.Reactor.Factories; rank by JaroWinkler; filter by return-type assignability at the use site. (Return-type assignability filter is a low-priority follow-up; today the CS0103 path filters by Reactor-namespace membership of the candidate, which has worked on the hand-authored fixtures.) - CS0117 (no static member): walk static members of the named type; same fuzzy match.
- CS1503 (argument type mismatch): special-case
Element-expected vs. string-supplied → suggestCaption/Heading/Body.Actionvs.Action<T>→ surface lambda-shape mismatch. - CS7036 (no overload takes N args): rank overloads by Hamming distance on the parameter-shape vector; suggest the closest overload's named-argument form. (Implemented by parameter-count distance; full Hamming over (kind, type)-vector deferred until Data Checkpoint B shows a case where shape-matters beyond arity.)
- One unit test per code path, both positive and negative.
- In
CheckCommand.Run, after parsing eachDiag, if itscodematches CS1061 / CS0103 / CS0117 / CS1503 / CS7036 AND the diagnostic touches aMicrosoft.UI.Reactor.*symbol, runSymbolSuggester.Suggest. - If the suggester returns a non-null
TextwithConfidence ≥ T, append→ try: <text> // [<evidence>]to the diagnostic line. - Existing analyzer-ID hint table (
HintFor) still wins ties (spec §9). - Integration test:
tests/Reactor.IntegrationTests/MurCheck/CS1061ButtonOnClickTest.cs— fixture project with the canonicalButton(...).OnClick(x)mistake; assertmur check ./fixtureexits 1, stdout contains exactly the expected suggestion line including evidence. — Deferred. Needs a fixture project that references Reactor (WindowsAppSDK restore on every test run) — heavy, scoped as a follow-up. The orchestrator unit tests cover the same logic against an in-memory compilation that uses the real Reactor stub shape. - Integration test: when Tier 2 has no high-confidence suggestion, the original diagnostic line is unchanged. (Covered by
MurCheckSmokeTest.csend-to-end againstFixtures/SmokeFixture/— non-Reactor receiver, no→ try:suffix attached.)
- Total
mur checkwall time on the fixture project stays within 1.2× the underlyingdotnet build. Capture in a perf-trait test. — Deferred until the Reactor-touching integration fixture lands; the underlyingdotnet buildtime on a WinUI fixture is not yet measured in CI. - Telemetry hook at
(diagnostic_emitted, suggester_name, confidence)— local-only JSONL append at~/.mur/telemetry/<yyyy-mm-dd>.jsonl. Opt-in via env varMUR_TELEMETRY=1. - Telemetry payload is reviewed against the source-code-leak rules from the conventions header. Add a unit test that asserts the payload contains no field whose value is longer than 256 bytes.
- All Phase 1 tasks above checked. (The integration test in 1.6 and the perf-trait test in 1.7 are explicitly deferred follow-ups; everything code-side is implemented and unit-tested.)
- Data Checkpoint B landed; thresholds calibrated.
Thresholds.cswritten;SymbolSuggesterreads per-code T viaThresholds.For(code)(gate consolidated to a single source of truth inSuggest, redundant duplicate cut removed from the orchestrator). Tuning harness lives undertests/Reactor.Tests/CheckCommandTests/Tuning/; report snapshot indocs/specs/tasks/038-tuning-reports/2026-05-10-50run.md. The 50-run corpus is small enough that the per-code values are intentionally conservative; revisit at Data Checkpoint C (500+ pairs). - Run Eval Checkpoint EC1 vs.
main. 5×N landed 2026-05-10. Results: kanban PASS (−24% cost mean, −33% median, 3.4× lower variance); calc FAIL (+21% cost mean — per-invocation overhead does not amortize on ~150-LoC problems). Firings ≥ 1 per kanban run ✓; first-build OK 5/5 both arms ✓. Strict spec criterion ("tokens not regressed") fails on calc. Detailed results under Eval Checkpoints → "EC1 results" above. - Decision recorded: approach (b) — diagnostic-count gate. Implementation in
CheckCommand.ShouldEmitSuggestions; flag--suggest-threshold <N>(default 3, 0 = off). Unit tests inCheckCommandPipelineTests.Gate_*+CheckArgsTests.Suggest_threshold_*. - EC1 re-run with the gate on — 5×N landed 2026-05-10. Calc neutralized (cost −4% mean, tokens parity); kanban win preserved (cost −33% mean, −39% median; 4/5 runs fired Tier-2). First-build OK 5/5 both variant arms. Both accept-bar criteria met; Phase 1 cleared to merge to
main. Detailed results under Eval Checkpoints → "EC1 re-run (with gate) — 5×N landed 2026-05-10".
Goal: ship the -- passthrough, the --strict / --final / --quiet mode flags, and the hand-authored base_policy(code) table from spec §8. Suppression is the single biggest token-saver in the spec; it's deterministic, so it doesn't wait on Data Checkpoint C.
Blocks on: Phase 1 merge to main.
- Add
src/Reactor.Cli/Check/ArgsParser.cs. Split input args on the first bare--; left half parsed againstmur check's flag grammar; right half forwarded verbatim. - Default-merging:
murinjects--nologo,-v:m,-p:Platform={host arch}only if the user did not specify the same flag in the passthrough. Detection by flag name, not value. (Matches-,--, and/prefixes — MSBuild accepts all three.) - Unknown
murflags before--produce a clear error message (do not silently forward). Error message includes a hint about the--separator to catch the canonical typo case (mur check --quie -- -c Release). - When
--traceis on, record the full effectivedotnet buildcommand line in trace output (per spec §8 last paragraph). Written as akind: "command"header row at the head of the trace. - Unit test matrix per spec §8 examples (host arch override, release config + no-restore, verbosity, TFM-via-passthrough, multiple properties, with non-default path). Lives in
tests/Reactor.Tests/CheckCommandTests/ArgsParserTests.cs. - Integration test:
mur check -- -p:Platform=x64overrides host-arch default; effective command line correct. — Deferred. Same blocker as Phase 1 §1.6: needs a fixture project that references Reactor (WindowsAppSDK restore). The unit testPassthrough_platform_suppresses_default_injection_by_flag_nameasserts the same invariant against the parser; the full process-spawning integration test lands when the Reactor-touching fixture lands.
- Add
--strict/--final/--quiettoArgsParser. Each maps to aMode { Iteration, Strict, Final, Quiet }enum. - Add
--emit-threshold <float>to override the ranker threshold (default 0.6 in iteration mode, 0.0 in final mode). Validates float in[0.0, 1.0]. - All flags appear in
--helpwith one-line descriptions. - Unit test: every mode round-trips through
ArgsParser.
- Create
src/Reactor.Cli/Check/Ranker/PolicyTable.cswith the score table from spec §8 (CS errors 1.0/1.0; REACTOR_* Warning 0.9/1.0; REACTOR_* Info 0.2/1.0; etc.). Implementation uses a universal-error floor (any Error severity scores 1.0 regardless of code) so the table can't accidentally hide a real build break. - Create
src/Reactor.Cli/Check/Ranker/Ranker.cs. Public method:double Score(in Diag d, in RankerContext ctx). Phase 2 is the PolicyTable lookup; severity_weight is encoded in the per-severity rows, the remaining formula terms (location_weight, recency_weight, accept_history) wait for Phase 4 signals. - Pre-emit gate: in
CheckCommand, after parsing, runRanker.ShouldEmitper diagnostic and drop any whose score is below the active threshold. Trace is unaffected — every parsed diagnostic still hits the trace file so the suppressed-then-resurfaced telemetry hook (spec §8 failure-mode mitigation) can mine the full stream. - Unit test: in iteration mode,
CS1591(XML doc) is suppressed;CS1061is not. - Unit test: in
--finalmode, both are emitted. - Unit test:
--strictpromotes warnings to errors (composes with-p:TreatWarningsAsErrors=truefrom passthrough; more aggressive wins per spec §8). Verified byStrict_promotes_reactor_warning_to_error_and_emits. - Unit test:
--quietemits only severityErows.
- Add an offline tool at
tools/Reactor.MurCheckGuardrail/Program.csthat reads two trace files (one frommur checkiteration, one frommur check --final) and asserts: every code that fired in--finaland is in the policy table's iteration-suppression list was not an error in--final. (If suppressed diagnostic codes start surfacing as errors in the final pass, the policy table is wrong and CI fails.) The tool re-uses PolicyTable directly (viaInternalsVisibleToonReactor.Cli) so the audit and runtime can never drift. Eight unit tests inGuardrailRunnerTests.cs. Also emits an advisory (non-failing) when a code is suppressed as a Warning in iteration but surfaces as an Error in--final(the-warnaserrorupgrade case). - Wire into CI: every PR that touches
PolicyTable.csruns the guardrail against a fixed set of fixture projects. — Deferred, blocked on fixture infrastructure. The "fixed set of fixture projects" needs Reactor-touching .csprojs that compile throughdotnet buildend-to-end (same WindowsAppSDK restore blocker as Phase-1 §1.6 deferred). When that lands, add apolicy-table-guardrailjob to.github/workflows/ci.ymlgated onpaths: [src/Reactor.Cli/Check/Ranker/PolicyTable.cs, tools/Reactor.MurCheckGuardrail/**].
- Update
plugins/reactor/skills/reactor-build-and-check/SKILL.mdto direct agents to runmur check(iteration) inside the loop andmur check --finalonce iteration is clean. The transition is the explicit "I am done iterating" signal. - Update the eval prompt in the agent-eval harness (lives outside this repo; coordinate with #226 owners). — External coordination required; flag in PR description.
- All Phase 2 tasks above checked (with documented deferred integration follow-ups behind the WindowsAppSDK-fixture blocker).
- Run Eval Checkpoint EC2 vs.
mainat start of Phase 2. PASS by median — calc-mur clean win on every metric; kanban-mur at cost median parity. First-build 5/5 both arms. Criterion-2 guardrail audit deferred to harness retrofit (post-run analysis pass against final workspace state). Results recorded under "EC2 results — 5×N landed 2026-05-11" above. - Merge to
main.
Goal: ship a working ruleset across two rule classes (spec §6). Cadence: ~3 rules per PR; ship in batches until ~10–15 rules are live.
Blocks on: Data Checkpoint C, Phase 2 merge to main. Every PR also blocks on the Human Validation Gate at the top of this document.
Two rule classes (spec §6 split):
- Class A — induced. Sourced from
patterns.jsonclusters. Justification bar: clusterfrequency ≥ 0.05ANDcount ≥ 10AND cross-agent reproducibility. - Class B — vocabulary-translation. Deliberately authored from a curated WPF / Silverlight / WinUI 1 / WinUI 2 / WinUI 3 → Reactor name table. Frequency bar is waived (spec §6); the justification is structural — the prior-framework citation in public docs — not corpus-empirical. Cross-agent / fixtures / reviewer bars (#2–5 of the Validation Gate) still apply.
Land before any rule PR opens:
- Name a corpus-pipeline owner. Spec §11 row "Data pipeline … lacks an SLA or named owner": for load-bearing operation, "harness team" is not a sufficient owner. Single point-of-contact, documented in this file once decided.
- Corpus refresh cadence pegged to Reactor minor-version releases. Each Reactor minor cuts a new corpus before Phase-3 rules referencing that minor's APIs can ship.
- Curated
WinUI-to-Reactor.csvtable (spec §14 #9). Single in-repo artifact atdocs/specs/tasks/038-vocab-table.csv; columns:source_framework, source_name, target_reactor_name, source_doc_url, notes. Owned by Reactor team; reviewed independently of rule PRs. Seeded 2026-05-11 with 20 rows covering WPF / Silverlight / WinUI 2 / WinUI 3 → Reactor translations from the 525-run corpus's Phase-3 priority targets plus desk research againstskills/reactor.api.txt.
- Create
src/Reactor.Cli/Check/Rules/IRulePattern.cs. Defineinterface IRulePattern { string Name { get; } string Provenance { get; } IReadOnlyList<string> DiagnosticCodes { get; } IReadOnlyList<string> DeclaredTargets { get; } RuleSuggestion TryMatch(in RuleContext ctx); }.Provenancecarries"cluster:<id>"for Class A or"vocab:<framework>"for Class B.DiagnosticCodeslets the orchestrator skip TryMatch for unrelated codes;DeclaredTargetspowers the §3.1a CI gate. - Create
record RuleContext(SyntaxNode Node, Diagnostic Diagnostic, ITypeSymbol? Receiver, SemanticModel SemanticModel, CSharpCompilation Compilation, RuleSymbolResolver Resolver). Compilation + Resolver are bundled so rules never re-discover the per-compilation cache. - Create
record RuleSuggestion(string? Text, double Confidence, string Evidence).Silentstatic +HasMatchmirrorSuggestionResultexactly for consistency. -
RuleRegistrydiscovers rules by reflection on assembly load (restricted to the Reactor.Cli assembly — first-party only).Defaultsingleton +Of(IEnumerable<IRulePattern>)test seam.BestMatchreturns the highest-confidence match across enabled rules; rules whoseDeclaredTargetsfail to resolve self-skip viaonSelfDisabledcallback. DuplicateNames throw at registry construction. - CLI:
mur check --disable-rule <Name>(repeatable) and--list-rulesround-trip throughArgsParser, surfaced in--help.--list-rulesshort-circuits beforedotnet buildruns, prints a name/provenance/status table, and exits 0. Unknown--disable-rule <Name>references warn (do not error) so a typo doesn't fail a build. Phase-4 telemetry will add accept-rate to the listing later. - Unit tests:
RuleContractTests.cs(shape, Silent vs HasMatch),RuleRegistryTests.cs(order, dedup, BestMatch, disable, self-disable, throwing-rule-isolation, Statuses, lazy singleton),RuleSymbolResolverTests.cs(resolve hit/miss, per-compilation cache identity, method/member resolution),SuggesterOrchestratorRuleTests.cs(rule matches codes outside Tier-2 SupportedCodes, rule wins over Tier-2 when both match, --disable-rule yields to Tier-2, evidence carries provenance),ArgsParserTests.cs(six new tests covering both flags including the flag-shaped rejection of--disable-rule --quiet).
Every rule binds target types and members through Roslyn ISymbol references resolved against the live Compilation, not by string-matching MemberAccessExpressionSyntax.Name.ValueText.
-
RuleSymbolResolverexposesResolveType(string),ResolveMethod(INamedTypeSymbol, string), andResolveMember(INamedTypeSymbol, string). Cached viaConditionalWeakTable<CSharpCompilation, _>— distinct compilations get distinct resolvers; the same compilation reference always returns the same resolver (test-locked). - When a rule's
DeclaredTargetscannot resolve, the registry short-circuits and reports the rule via theonSelfDisabled(name, firstUnresolved)callback. The rule appears asself-disabled (unresolved: <target>)in--list-rules. Trace-channel structured warning hook landed 2026-05-11:TraceWriter.WriteRuleSelfDisabled(rule, target)emits{kind: "rule_self_disabled", rule, unresolved_target, mode};SuggesterOrchestratorthreadsonRuleSelfDisabledthrough toRuleRegistry.BestMatch;CheckCommand.Runwires it to the active trace writer (dedup'd per-invocation per-rule). Stdout stays clean. - CI gate: rule-target resolution test.
tests/Reactor.Tests/CheckCommandTests/Rules/RuleTargetResolutionTests.csinstantiates every registered rule against a live ReactorCompilation(full assembly references — the inverse ofTestCompilation.Create) and asserts each declared target resolves. Passes vacuously today (zero rules); becomes the load-bearing gate the moment the first rule lands. - Performance bound. Symbol-resolution adds ≤ 0.5 ms median per rule per diagnostic. Captured in the perf-trait suite as
RulePerformanceTests.BestMatch_median_under_per_rule_budget(tests/Reactor.Tests/CheckCommandTests/Rules/RulePerformanceTests.cs,[Trait("Category","Perf")]). AssertsRuleRegistry.Default.BestMatchmedian across 1000 iters on the canonical CS1061-on-ButtonElement.OnClickfixture stays under0.5 × rule_count × 4ms (4× CI slack matchesCompilationLoaderTestsconvention). The 4× slack is the noise budget for Windows Defender / shared-runner CPU contention / GC tail latency; tighten when the first time-budget regression lands. Local run: passes well under budget with 3 rules.
For each rule in a batch, the author must complete all six bars of the Human Validation Gate before merge — with one explicit Class-B carve-out on bar #1 (frequency).
- Author
src/Reactor.Cli/Check/Rules/<Name>Rule.cs. - Set
Provenance = "cluster:<cluster_id>". - Bind all target types/methods through
RuleSymbolResolver(no string matching). - Author
tests/Reactor.Tests/CheckCommandTests/Rules/<Name>RuleTests.cswith ≥ 3 positive fixtures drawn fromfixes.jsonl, each from a differentrun_id. Each fixture references the sourcerun_idin a comment. - Author ≥ 2 negative fixtures in the same test file.
- PR description includes the fixed-format Validation Gate comment template, filled in.
- Confirm cluster has
frequency ≥ 0.05ANDcount ≥ 10AND reproduces across ≥ 2 agents (citepatterns.jsonrow). - Confirm corpus age: rule may not merge against a corpus older than the latest Reactor minor release (3.0-prerequisite from §3.0).
- Reviewer (not the author) leaves PR comment using the template.
- After merge, log
Name,Provenance,count, merge-date indocs/specs/tasks/038-rule-history.md.
- Add the row to
docs/specs/tasks/038-vocab-table.csv(or confirm it already exists). One vocab-table row per rule. - Author
src/Reactor.Cli/Check/Rules/<Name>Rule.cs. - Set
Provenance = "vocab:<framework>"(e.g.vocab:WinUI3). - Bind all target types/methods through
RuleSymbolResolver. - Author
tests/Reactor.Tests/CheckCommandTests/Rules/<Name>RuleTests.cswith ≥ 3 positive fixtures. Fixtures may be hand-authored from the vocab-table row (Class B does not requirefixes.jsonlprovenance) but are tagged[Trait("Origin", "VocabHandAuthored")]for audit. - Author ≥ 2 negative fixtures in the same test file. Negative fixtures specifically include "same diagnostic on a non-Reactor receiver" and "same name on the same receiver in a context where the translation does NOT apply" (e.g. a property access, not a method call).
- PR description includes the Validation Gate comment template, with bar #1 (frequency) marked "waived — Class B" and bar #2 (cross-agent reproducibility) demonstrated either via the corpus OR via citation of the source-framework docs the prior name lives in.
- Reviewer (not the author) leaves PR comment using the template.
- After merge, log
Name,Provenance,source_doc_url, merge-date indocs/specs/tasks/038-rule-history.md.
- Before EC2 (already passed in Phase 2): 0 rules required.
- Before EC3: 5 high-confidence rules covering ≥ 50 % of fix events by frequency (Class A coverage) plus ≥ 1 Class-B rule. Below this bar, EC3 is delayed until the bar is hit.
- V1 ship: 10–15 rules covering ≥ 80 % of fix events. Mix is at-author's-judgement but expect ~40 % Class A / 60 % Class B given the 525-run corpus's vocabulary-confusion signal. Past 15, returns diminish; remaining clusters move to Tier 4 (Phase 4).
- At least 10 rules merged across both classes.
- Coverage check: cumulative
countof all merged Class-A rules' seed clusters ≥ 0.80 offixes.jsonlrow count or Class-B rules cover ≥ 80 % of the documented prior-framework vocabulary table (whichever target the team picked). - No rule has accept-rate < 50 % over the last 200 invocations (auto-suppression has not had to fire on any merged rule).
- No rule is self-disabled due to unresolved target (the CI gate from §3.1a fails the build otherwise — this is a belt-and-suspenders assertion at exit).
- Run Eval Checkpoint EC3 vs.
mainat start of Phase 3. Pass criterion: cumulative ~−14 % tokens vs. start-of-spec, ~−2 turns, ~−$0.70 (per spec §12); CV ≤ start-of-spec CV. - Merge to
main. V1 of spec 038 is shipped.
Status change vs. earlier draft: was "optional, only if needed." Promoted to scheduled, deferred until Data Checkpoint D delivers ≥ 1K negative-class ranker rows (spec §13 Phase 5 update + §1 load-bearing argument). It is not a maybe; it is the work we open when the data is ready. The deterministic floor (Tiers 1–3 + the Phase-2 policy table) carries the experimental phase; the learned ranker is what we ship once the corpus delivers training volume.
The escape hatch — a documented decision to ship Phase 4 with the deterministic table only — remains, but is the unexpected outcome and requires its own decision artifact.
Blocks on: Data Checkpoint D, EC3 merge.
- Local-first telemetry collector: read JSONL from
~/.mur/telemetry/; emit aggregated per-code, per-rule accept rates; opt-in upload to a team-internal endpoint (TBD; coordinate with the Reactor team's existing telemetry policy). - Per-rule auto-suppression hook: rules with accept-rate < 50 % over the last 200 invocations are disabled at runtime; a follow-up issue is auto-filed with the rule's exemplar pairs and the agent edits that diverged from the suggestion.
- Implement training pipeline in
tools/Reactor.RankerTraining/(offline). Inputs:ranker-labels.jsonlfrom Data Checkpoint D. Output: ONNX model under 100 KB. - Features per spec §8: diagnostic code, severity, file category, turn index, prior-emit-and-ignored flag, file churn rate.
- Calibrate via isotonic regression on a held-out fold.
- Inference path:
src/Reactor.Cli/Check/Ranker/LearnedRanker.csloads the ONNX model from a NuGet'dMicrosoft.UI.Reactor.MurCheckModelpackage; falls back to deterministic policy table on load failure. - Per-diagnostic budget: ≤ 5 ms median.
- If EC3 telemetry shows Tier 2 + 3 still leave a meaningful tail of suggestion misses, train a small GBDT confidence ranker over hand-engineered features (Levenshtein, param-name overlap, factory-popularity, AST-shape similarity).
- Wire as Tier 4 in
CheckCommand: only consulted when Tier 2 + 3 produce conflicting candidates; output is a re-ranked candidate list with calibrated confidence head.
- Run Eval Checkpoint EC4. Pass criterion: ≥ 5 pp lift in iteration-mode emission precision vs. EC3, OR formal documented decision to ship Phase 4 with the deterministic table only.
- If shipped: merge. If not: document the decision in spec 038 §13 and close.
Three test tiers, mirroring spec 020's pattern:
| Tier | Project | What it covers | Speed |
|---|---|---|---|
| Unit — pure | tests/Reactor.Tests/CheckCommandTests/ |
Suggester logic, rule logic, ranker scoring math, args parser. Fakes for Compilation where possible. |
< 5 ms |
| Unit — Roslyn | tests/Reactor.Tests/CheckCommandTests/ ([Trait("Category","Roslyn")]) |
Suggesters / rules driven through a real CSharpCompilation constructed in-memory. No file system, no dotnet build. |
~20–100 ms |
| Integration | tests/Reactor.IntegrationTests/MurCheck/ |
Real mur check invocations against fixture projects under tests/Reactor.IntegrationTests/MurCheck/Fixtures/. Each fixture is a tiny broken Reactor app. Shells out to dotnet build. |
~1–5 s per test |
| Perf | Same as Integration, [Trait("Category","Perf")] |
Cold/warm Compilation load; total mur check overhead vs. dotnet build. Excluded from default test runs; opt-in via filter. |
varies |
Conventions:
- Every suggester / rule has at least one positive and one negative test in the Unit — Roslyn tier.
- Every CLI flag has at least one Unit — pure test (against
ArgsParser) and one Integration test (full invocation). - Tier-3 rules' fixture tests must cite their source
run_idfromfixes.jsonlin a comment so a future maintainer can trace the rule back to the data that motivated it.
- Source code never leaves the user's machine. Telemetry payloads are reviewed against this rule on every PR that touches the telemetry module.
- Trace files are opt-in (
--trace <path>). No implicit telemetry. No background uploads. Compilationreferences are resolved via the project's ownobj/project.assets.json. We do not download arbitrary packages and we do not honornuget.config<add>entries that point outside the user's existing trust set.- Symlink handling in
CompilationLoader: symlinks pointing outside the project root are followed but logged at trace level; we do not panic, but we also don't include the file in the Compilation. Test fixture covers this. - No code execution from rules.
IRulePattern.TryMatchis restricted to read-only Roslyn syntax / semantic model APIs. CodeReview lints any rule that callsProcess.Start, file I/O, or network. - Diagnostic message text can contain user code fragments. The ranker uses message text but never logs it to telemetry (codes only).
Captured as [Trait("Category","Perf")] integration tests; CI runs nightly, breaks on regression > 10 % from the recorded baseline.
| Surface | Cold | Warm | Notes |
|---|---|---|---|
CompilationLoader.Load |
≤ 500 ms | ≤ 50 ms | per project. Cache key: (absolute-csproj-path, sorted-file-mtime-hash). |
SymbolSuggester.Suggest per diagnostic |
≤ 10 ms median | — | stateless. |
IRulePattern.TryMatch per rule per diagnostic |
≤ 2 ms median | — | stateless; ≤ 30 rules → ≤ 60 ms aggregate. |
Ranker.Score per diagnostic |
≤ 1 ms (deterministic) ≤ 5 ms (learned) | — | hot path; called for every diagnostic. |
mur check total wall vs. underlying dotnet build |
≤ 1.2× | ≤ 1.1× warm | Spec §2. |
mur check output is developer-facing tooling, not user-visible UI. Output is en-US; localization is not in scope. (Same convention as dotnet build.)
This section captures the operational responsibilities a load-bearing system inherits beyond the per-phase work above. Each item is a recurring obligation, not a one-time task.
API-churn protocol (per Reactor minor release):
- Run the rule-target resolution CI gate (§3.1a) against the new Reactor
Compilation. Any rule whose target evaporates fails the build. - For each failed rule, the owner either (a) updates the rule's target binding via
RuleSymbolResolver, or (b) retires the rule. No silent string-swap. - Re-run the threshold-tuning harness against the latest corpus restricted to the new Reactor minor before re-baselining any threshold. The full historical corpus stays archived but is not used for re-tuning across an API break.
- Cut a new corpus drop on the new Reactor minor (Data Checkpoint refresh; same audit checklist as Checkpoints A–D).
- Log the churn-handling pass in
docs/specs/tasks/038-rule-history.mdwith the Reactor version and the set of rules touched.
Corpus freshness:
- A rule may not merge against a corpus older than the latest Reactor minor release. Enforced by the PR template; reviewer checks the corpus timestamp.
- Stale rows in archived corpora (referencing retired APIs) are marked
historical: truerather than deleted — they remain valid training signal for the kind of mistake and may be useful for future cross-version ablations.
Per-rule accept-rate monitoring (post-Phase-4):
- Auto-suppression fires when accept-rate drops below 50 % over the last 200 invocations (spec §11 risk row; Phase 4 telemetry hook).
- A suppressed rule files a follow-up issue automatically; the rule's author or current owner triages within one sprint.
- Re-enabling a previously-suppressed rule requires a new PR that explains the regression, with the same six-bar Validation Gate.
Sunset readiness check (annual):
- Spec §13 defines two sunset conditions: ≥ 12 months Reactor API stability AND ≥ 90 % first-build-OK on a held-out Reactor eval across ≥ 2 vendor-distinct models without
mur checkassistance. - Run the readiness check yearly. When both conditions hold, file the sunset issue, freeze rule additions, and plan graceful removal. The trace/
--finalmode survives; only Tiers 2–4 sunset.
- Data Checkpoint B / C / D ETAs are unknown. The harness team is on a separate cadence. If B slips, Phase 1 ships with a guessed T and we tune in a follow-up PR (low risk). If C slips, Phase 3 cannot start — communicate explicitly to stakeholders. If D slips, Phase 4 simply doesn't happen this cycle.
- A bad Tier 3 rule landing. Mitigation: the Validation Gate, the auto-suppression telemetry, and the post-merge tracking in
docs/specs/tasks/038-rule-history.md. - Tier 2 false positives at threshold edges. Mitigation: per-code thresholds (not one global T), tuned against Data Checkpoint B before Phase 1 ships.
- Performance regression from learned ranker (Phase 4). Mitigation: deterministic table is the floor; learned ranker is a re-rank on top. Fall back to deterministic if learned model load fails.
- Coordinated change with eval prompt (#226 ownership) for Phase 2's
--finalworkflow. Risk: skill / prompt land out of phase with code. Mitigation: ship them in the same PR cycle, with the skill change explicitly listed in the Phase 2 exit criterion.
- Spec 038:
docs/specs/038-mur-check-did-you-mean-design.md - Companion spec 037:
docs/specs/037-eval-trace-mining-design.md - Existing
CheckCommand:src/Reactor.Cli/Check/CheckCommand.cs - Reactor analyzers (12
REACTOR_*IDs):src/Reactor.Analyzers/ - Roslyn version:
Microsoft.CodeAnalysis.CSharp4.8.0 insrc/Reactor.Cli/Reactor.Cli.csproj - Sample apps (validation corpus):
samples/apps/ - Originating issues: #226 §5, #227, #228 (specs PR)
- Harness review-feedback:
C:\temp\eval-trace-mining-followups.md(sent to harness owner)