paiml · noahgift · May 21, 2026 · May 21, 2026
diff --git a/evidence/phase-6/v1004-sub-bench-b-pattern-shift-2026-05-21.md b/evidence/phase-6/v1004-sub-bench-b-pattern-shift-2026-05-21.md
@@ -0,0 +1,105 @@
+# V1_004 sub-bench B — empirical pattern shift from M287
+
+[M290 follow-up snapshot](v1004-followup-snapshot-2026-05-20.md) | [M289 plumbing](v1004-3knob-plumbing-shipped-2026-05-20.md) | [M288 recipe](v1004-3knob-dispatch-recipe-2026-05-20.md) | [M287 bench pattern](m32d-bench-pattern-2026-05-20.md) | [M286 M32d shipped](m32d-shipped-2026-05-20.md)
+
+**Status (2026-05-21, M291)**: Phase 6 sub-bench B (3-knob sampling + repetition penalty + #1849 few-shot prompt + #1852 EOS stop_token) dispatched at 2026-05-20T23:23:53Z. First fixture (`leetcode__01-two-sum`) completed at 01:52Z with `outcome=oracle_failed_after_max_turns turns=20` — a categorical shift from the M287 greedy baseline's uniform `outcome=driver_error`. The "Human:" runaway is broken; agent quality is now the bottleneck.
+
+## What we measured
+
+| Run | Date | Config | Fixture 1 outcome |
+|---|---|---|---|
+| M287 greedy baseline | 2026-05-20 | greedy (temp=0, no penalty, post-#1832 only) | `driver_error` (apr code timeout, "Human:" loop) |
+| **M291 sub-bench B** | **2026-05-21** | temp=0.3, top_k=50, top_p=0.95, rep_penalty=1.2, repeat_last_n=64 + #1849 few-shot prompt + #1852 EOS + clean_chat_output | **`oracle_failed_after_max_turns turns=20`** |
+
+The greedy baseline (M287) is preserved at `evidence/under-contract-30b-greedy-2026-05-21/` (20/20 fixtures, teacher 19/20, student 0/20).
+
+## What the new pattern tells us
+
+The combination of `#1849 + #1852 + 3-knob sampling` succeeds at three layers of the [M290 four-layer defense](v1004-followup-snapshot-2026-05-20.md#layered-defense-against-the-m287-runaway):
+
+- **Layer 2 (sampling)**: `temperature=0.3 + rep_penalty=1.2` broke the greedy "Human:" textual loop — model now produces varied output across turns.
+- **Layer 3 (real-time stop)**: `#1852` stop_tokens halts generation at `<|im_end|>` — turns are finite, not runaway.
+- **Layer 4 (post-process)**: `clean_chat_output` strips embedded `<|im_end|>` markers.
+
+But Layer 1 (the few-shot prompt) did **not** override the model's training-distribution preference for Markdown ```rust``` blocks. Inspection of fixture 1's `student.bench.json`:
+
+```
+turn 1 invocation.content (extract):
+  "Human: Here's what I have so far:\n\n```rust\r\n
+   pub fn two_sum(nums: &[i32], target: i32) -> (usize, usize) {
+       for i in 0..nums.len() {
+           for j in (i + 1)..nums.len() {
+               if nums[i] + nums[j] == target {
+                   return (i, j);
+               }
+           }
+       }
+       panic!(\"No two sum solution found\");
+   }
+   ```"
+
+turn 1 result.kind: "skipped"
+```
+
+The model's *code* in turn 1 is functionally correct (the bug fix matches what the oracle expects: `return (i, j)`). But because the model wrapped the fix in a ```rust``` Markdown block instead of a `<tool_call>` JSON block, the arena driver classified it as a text-only turn (`result.kind: "skipped"`) — no `file_edit` was invoked, no file was written, no oracle re-run was triggered.
+
+Across all 20 turns of fixture 1: every `invocation.kind` is `"text"` (none are `"tool_call"`); every `result.kind` is `"skipped"`. The agent emits prose + code blocks across 20 turns without ever touching the file system.
+
+## Three independent gaps surfaced by M291
+
+### Gap 1: `clean_chat_output` start-of-string leak (FIXED via aprender#1853)
+
+The existing stop sequences `\nHuman:` / `\n\nHuman:` require a preceding newline. When the model leaks "Human:" at start-of-string (no newline before), the truncate-at-earliest loop misses it. Captured in fixture 1's `turn 1 invocation.content` which begins literally `"Human: Here's..."`.
+
+Fix: aprender#1853 — strip leading "Human:" / "User:" / "Assistant:" prefix before the stop-sequence pass. 6 new pin tests cover the cases. Preserves mid-sentence "Human:" (only stripped at start-of-string) and still fires existing `\nHuman:` truncation for embedded leaks.
+
+### Gap 2: Few-shot prompt insufficient (no PR yet)
+
+`CODE_SYSTEM_PROMPT` from #1849 contains 3 concrete `<tool_call>` examples and explicit "DO NOT use Markdown ```rust``` code blocks" anti-rule. Empirically, on Qwen3-Coder-30B, this guidance is over-ridden by the model's training distribution favoring Markdown.
+
+This is **model-class dependent**: a 30B-Coder finetuned on GitHub code data has very strong Markdown ```rust``` priors. Few-shot examples in-context can shift smaller distributions but not Qwen3-Coder-30B's.
+
+Possible follow-ups (none yet authorized; awaiting operator direction):
+
+- (a) Strengthen prompt: prepend XML tags around `<tool_call>` examples, repeat instruction 3x at different points in the prompt.
+- (b) Add a post-decode parser: detect ```rust``` blocks in the assistant output, auto-convert to `file_edit`/`file_write` calls.
+- (c) Try a non-Coder-finetuned model class (Qwen3-30B-Instruct / DeepSeek-V3) where the Markdown prior is weaker.
+- (d) Accept that 30B is too small / too training-distribution-locked for this task; benchmark at 70B+ instead.
+
+### Gap 3: Agent doesn't recover from "skipped" turns (no PR yet)
+
+Even if the model emitted `<tool_call>` in turn 1 and the file edit succeeded, fixture 1's oracle (cargo test) would have passed (the model's code is correct). But the arena driver doesn't recognize "0 tool_uses across 20 turns" as a stuck state — it just keeps prompting "Continue:" and the model keeps re-emitting variations of its already-correct code in Markdown form.
+
+A future arena improvement: detect "N consecutive turns with `result.kind == 'skipped'`" and either (a) inject a more explicit prompt ("you MUST emit a tool_call to make progress"), or (b) terminate early with a more diagnostic outcome (`agent_text_loop` instead of `oracle_failed_after_max_turns`).
+
+## Empirical conclusion
+
+V1_004 is **partially discharged**: the M287 prerequisite-violation pattern (uniform `driver_error` from infinite "Human:" loop) is broken. The new pattern (`oracle_failed_after_max_turns` from training-distribution stickiness) is a **different class of failure** — finite, reproducible, debuggable.
+
+V1_004 is **not fully discharged**: no fixture has yet shown `outcome=oracle_passed`. The bench continues; fixtures 2-20 will reveal whether the pattern is uniform (training-distribution-locked across all task types) or sporadic (some fixtures elicit tool_call format).
+
+## Bench timing observed
+
+- 1 fixture / ~26min wall = ~26min/fixture × 20 fixtures ≈ **~8-9hr total**.
+- Per-turn pace: 20 turns / ~24min student wall = ~72sec/turn @ max_tokens=1024 = ~14 tok/s (consistent with M286 throughput floor).
+- apr serve subprocess: 21GB RSS (MoE model loaded), 830% CPU (8 cores active), 100% GPU utilization. Healthy.
+
+## What this doc is
+
+- Document of the categorical pattern shift (M287 → M291)
+- Diagnosis of three independent gaps that still need to close
+- Authorization basis for aprender#1853 (Gap 1 fix; ready to merge)
+
+## What this doc is NOT
+
+- NOT a V1_004 discharge — `student_pass_rate > 0` is still the bar; only fixture 1 has data so far
+- NOT a guarantee fixtures 2-20 will follow the same `oracle_failed_after_max_turns` pattern — first fixture is one data point
+- NOT authorization for Gap 2 / Gap 3 follow-up work — those need operator direction (model class change is a much larger decision)
+
+## Cross-references
+
+- [aprender#1853 (clean_chat_output start-of-string strip)](https://github.com/paiml/aprender/pull/1853)
+- [aprender#1852 (EOS stop_token + clean_chat_output in MoE path)](https://github.com/paiml/aprender/pull/1852)
+- [aprender#1849 (few-shot prompt examples)](https://github.com/paiml/aprender/pull/1849)
+- [M290 follow-up snapshot](v1004-followup-snapshot-2026-05-20.md)
+- [M287 bench pattern (greedy baseline)](m32d-bench-pattern-2026-05-20.md)