swe-textbook: relabel completeness-grep + golden-test behavior gate

lukstafi · claude · lukstafi · commit 8de8b2395021 · 2026-06-05T16:23:45.000+02:00
From task-3170a13a retrospective (order reversal, PR ocannl-staging#27): two
captures — (1) a relabel's completeness check is a repo-wide negative grep
minus an allowlist (case-insensitive, plural-aware, exclude historical
records), not a positive sweep of planned files; (2) a label-printing golden
test staying all-PASS is the direct decision-set-unchanged gate. Via
/ludics-process-suggestions.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/swe-textbook.md b/docs/swe-textbook.md
@@ -681,3 +681,19 @@ Description: When an acceptance criterion names two places a value must appear 
 Precipitating retro: `task-35e74651` (PR #557, 2026-06-05). AC4 named both the briefing-lag stale-sentinel note and the health-check `outbound-staging-ff-stale:<project>` finding as carriers of the outbound-push cause+remedy annotation. The briefing-lag note had a clean `src/` seam (`src/briefing-lag.ts` reading `latestOutboundCauseRemedy`); the health-check finding is computed inline in `skills/ludics-health-check.md` bash with no `src/` read-boundary, and the AC forbade adding skill prose. The coder implemented the briefing-lag arm, left the skill markdown byte-unchanged, cited the proposal's own flagged-ambiguity lines (~209–223) authorizing the narrower scope, and the deferred surface was filed as follow-up `task-c4aedd6b`.
 
 Filter decision: Under the competent-SWE filter this is "obvious-to-experienced-engineer" scope discipline — knowing not to fabricate plumbing or violate a stated constraint to satisfy an AC's letter is general engineering judgment. Captured here rather than promoted to always-loaded prompts because the failure mode is shape-specific (a multi-surface AC where the surfaces have asymmetric implementability) and the always-loaded `feedback_self_contradicting_ac_revise_not_fabricate.md` already covers the broader "don't fabricate to satisfy a contradictory AC" family; the textbook entry preserves the specific "implement the seam, cite the proposal's authorization for the gap, file the rest as follow-up" resolution as a precedent for coders facing partially-implementable ACs.
+
+### A relabel's completeness check is a repo-wide negative grep minus an allowlist, not a positive sweep of the planned files
+
+Description: When a change relabels a vocabulary across a codebase — a rename, a terminology flip, an order reversal — the question "did I get all of it?" is not answered by greping the files you planned to touch and finding them clean. "Clean over the swept files" is not "no stale vocabulary remains": the stale term hides in files outside the plan (a `.mli` interface, an `AGENTS.md`, a paper `.tex`, a constructor name in a module you didn't list). The reliable completeness check inverts the method: run the term-list grep **repo-wide**, then **subtract an explicit allowlist** of known-OK hits, and require the remainder to be empty. Three refinements the naive grep misses: (a) make it **case-insensitive** — the term appears capitalized in prose and in type constructors (`Cur`/`Subr`, `Least Upper Bounds`), not just lowercased in code; (b) make it **plural- and morphology-aware** (`LUBs?`, not `LUB`) — the plural form slips a `\bLUB\b` anchor; (c) the allowlist is not just false-positive substrings (`cur_sh`, `occur`) but also **intentionally-correct-post-change strings**: under a structural transform like an order reversal, a surface phrase can read like stale vocab yet be genuinely correct in its new referent (after reversing the lattice, "any two dims have a least upper bound" is a true statement about the join-semilattice's joins, distinct from the maintained solver bound that is now the GLB). Distinguishing "stale" from "intentional dual" is part of building the allowlist, and is a judgment call the bare grep cannot make. A companion scope rule: **exclude historical-record files** from the sweep — changelogs (`CHANGES.md`), decision logs, and superseded proposals legitimately preserve the old vocabulary because they describe what was true at a past version; rewriting them to the new terms falsifies the record. Current guidance and exposition (`AGENTS.md`, `docs/*.md`, `*.tex`) get reoriented; historical records do not.
+
+Precipitating retro: `task-3170a13a` (PR ocannl-staging#27, 2026-06-05) — the broadcast-order reversal (LUB→GLB, top↔bottom). The coder's round-1 checklist (`rg '\bLUB\b|least upper|\bcur\b|\bsubr\b'`, case-sensitive, over the planned files) reported clean but silently missed `Least Upper Bounds (LUBs)` in `shape.mli`, the capitalized `Cur`/`Subr` constructors in `row.ml`, and `LUB` in `AGENTS.md`. The round-2 fix re-ran `rg -in 'broadcast.{0,4}bottom|\bLUBs?\b|\blub\b|least upper bound|\b[Ss]ubr\b|\b[Cc]ur\b|…'` repo-wide minus an allowlist (`cur_sh`/`current`/`occur*`, the intentional join-semilattice duals, traversal `bottom-up`), excluding `docs/proposals/*` and `CHANGES.md` as historical records, and required zero remaining hits.
+
+Filter decision: Under the competent-SWE filter this is "obvious-to-experienced-engineer" sweep thoroughness — a competent engineer knows a grep can be too narrow. Captured here rather than promoted to always-loaded prompts because the failure mode is recognition-shaped and method-specific (the inversion to negative-grep-minus-allowlist, the case/plural sensitivity, the intentional-dual allowlist, the historical-record exclusion) and pairs with the existing "Config-key removal sweep covers all of `templates/`" and "a guard added to a shared validator reaches every transitive caller" entries — all three are "the sweep is wider than first enumerated", this one specialised for vocabulary relabels under a structural transform.
+
+### For a behavior-preserving relabel, a label-printing golden test that stays all-PASS is the direct decision-set-unchanged gate
+
+Description: When a change is supposed to preserve behavior while renaming things (a pure relabel, an order-reversal, a terminology flip), the cheapest and most direct proof that no decision actually moved is an existing **golden/`%expect`/cram test that prints the system's own accept/reject (or classify, or dispatch) labels** for a battery of cases. If that test's expected output stays byte-identical except for the deliberately-changed wording — every case still reads `PASS`, or the same branch label — then the accept/reject *set* is provably unchanged, because a flipped comparison or reversed branch would move at least one case across the boundary and show up as a `PASS→FAIL` (or relabeled-decision) diff. This is stronger and cheaper than re-deriving invariance from the diff of the production code: you don't have to argue that no `<=` became `>=`; the golden test would have caught it. The discipline at relabel time: identify the test that enumerates the decision surface in printed form, freeze it, and treat *any* non-wording diff in it as the stop-and-flag signal that the relabel changed behavior. If no such test exists, writing one that prints the labels for the boundary cases is the highest-leverage safety net before starting the rename.
+
+Precipitating retro: `task-3170a13a` (PR ocannl-staging#27, 2026-06-05). The order reversal flipped the `⊑` operands and renamed `meet_dim→join_dim`. `test/einsum/test_basis_total_order.expected` prints `PASS`/`FAIL` for a battery of `d1 ⊑ d2` accept/reject cases (`5_rgb ⊑ 1_(bcast_if_1) accepts`, `5_rgb ⊑ 5_(bcast_if_1) rejects`, …); it stayed all-`PASS` after the relabel, directly proving the accept/reject set didn't move, while a `solve_dim_ineq` comparison reversal would have flipped at least one case. The only `.expected` diffs were wording/glyph-order promotes.
+
+Filter decision: Under the competent-SWE filter this is "obvious-to-experienced-engineer" test-strategy literacy — using a golden test as a behavior-invariance gate is general regression-testing sense. Captured here rather than promoted to always-loaded prompts because it is recognition-shaped (the cue is "I'm doing a behavior-preserving relabel — which existing test prints the decisions I must not move?") and complements the always-loaded post-edit-grep discipline with a test-side gate; pairs with the vacuous-assertion entries (a label-printing golden test is the *non-vacuous* counterpart — it fails under exactly the mutation a relabel risks).