Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions evidence/phase-6/v1004-sub-bench-b-pattern-shift-2026-05-21.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# V1_004 sub-bench B — empirical pattern shift from M287

[M290 follow-up snapshot](v1004-followup-snapshot-2026-05-20.md) | [M289 plumbing](v1004-3knob-plumbing-shipped-2026-05-20.md) | [M288 recipe](v1004-3knob-dispatch-recipe-2026-05-20.md) | [M287 bench pattern](m32d-bench-pattern-2026-05-20.md) | [M286 M32d shipped](m32d-shipped-2026-05-20.md)

**Status (2026-05-21, M291)**: Phase 6 sub-bench B (3-knob sampling + repetition penalty + #1849 few-shot prompt + #1852 EOS stop_token) dispatched at 2026-05-20T23:23:53Z. First fixture (`leetcode__01-two-sum`) completed at 01:52Z with `outcome=oracle_failed_after_max_turns turns=20` — a categorical shift from the M287 greedy baseline's uniform `outcome=driver_error`. The "Human:" runaway is broken; agent quality is now the bottleneck.

## What we measured

| Run | Date | Config | Fixture 1 outcome |
|---|---|---|---|
| M287 greedy baseline | 2026-05-20 | greedy (temp=0, no penalty, post-#1832 only) | `driver_error` (apr code timeout, "Human:" loop) |
| **M291 sub-bench B** | **2026-05-21** | temp=0.3, top_k=50, top_p=0.95, rep_penalty=1.2, repeat_last_n=64 + #1849 few-shot prompt + #1852 EOS + clean_chat_output | **`oracle_failed_after_max_turns turns=20`** |

The greedy baseline (M287) is preserved at `evidence/under-contract-30b-greedy-2026-05-21/` (20/20 fixtures, teacher 19/20, student 0/20).

## What the new pattern tells us

The combination of `#1849 + #1852 + 3-knob sampling` succeeds at three layers of the [M290 four-layer defense](v1004-followup-snapshot-2026-05-20.md#layered-defense-against-the-m287-runaway):

- **Layer 2 (sampling)**: `temperature=0.3 + rep_penalty=1.2` broke the greedy "Human:" textual loop — model now produces varied output across turns.
- **Layer 3 (real-time stop)**: `#1852` stop_tokens halts generation at `<|im_end|>` — turns are finite, not runaway.
- **Layer 4 (post-process)**: `clean_chat_output` strips embedded `<|im_end|>` markers.

But Layer 1 (the few-shot prompt) did **not** override the model's training-distribution preference for Markdown ```rust``` blocks. Inspection of fixture 1's `student.bench.json`:

```
turn 1 invocation.content (extract):
"Human: Here's what I have so far:\n\n```rust\r\n
pub fn two_sum(nums: &[i32], target: i32) -> (usize, usize) {
for i in 0..nums.len() {
for j in (i + 1)..nums.len() {
if nums[i] + nums[j] == target {
return (i, j);
}
}
}
panic!(\"No two sum solution found\");
}
```"

turn 1 result.kind: "skipped"
```

The model's *code* in turn 1 is functionally correct (the bug fix matches what the oracle expects: `return (i, j)`). But because the model wrapped the fix in a ```rust``` Markdown block instead of a `<tool_call>` JSON block, the arena driver classified it as a text-only turn (`result.kind: "skipped"`) — no `file_edit` was invoked, no file was written, no oracle re-run was triggered.

Across all 20 turns of fixture 1: every `invocation.kind` is `"text"` (none are `"tool_call"`); every `result.kind` is `"skipped"`. The agent emits prose + code blocks across 20 turns without ever touching the file system.

## Three independent gaps surfaced by M291

### Gap 1: `clean_chat_output` start-of-string leak (FIXED via aprender#1853)

The existing stop sequences `\nHuman:` / `\n\nHuman:` require a preceding newline. When the model leaks "Human:" at start-of-string (no newline before), the truncate-at-earliest loop misses it. Captured in fixture 1's `turn 1 invocation.content` which begins literally `"Human: Here's..."`.

Fix: aprender#1853 — strip leading "Human:" / "User:" / "Assistant:" prefix before the stop-sequence pass. 6 new pin tests cover the cases. Preserves mid-sentence "Human:" (only stripped at start-of-string) and still fires existing `\nHuman:` truncation for embedded leaks.

### Gap 2: Few-shot prompt insufficient (no PR yet)

`CODE_SYSTEM_PROMPT` from #1849 contains 3 concrete `<tool_call>` examples and explicit "DO NOT use Markdown ```rust``` code blocks" anti-rule. Empirically, on Qwen3-Coder-30B, this guidance is over-ridden by the model's training distribution favoring Markdown.

This is **model-class dependent**: a 30B-Coder finetuned on GitHub code data has very strong Markdown ```rust``` priors. Few-shot examples in-context can shift smaller distributions but not Qwen3-Coder-30B's.

Possible follow-ups (none yet authorized; awaiting operator direction):

- (a) Strengthen prompt: prepend XML tags around `<tool_call>` examples, repeat instruction 3x at different points in the prompt.
- (b) Add a post-decode parser: detect ```rust``` blocks in the assistant output, auto-convert to `file_edit`/`file_write` calls.
- (c) Try a non-Coder-finetuned model class (Qwen3-30B-Instruct / DeepSeek-V3) where the Markdown prior is weaker.
- (d) Accept that 30B is too small / too training-distribution-locked for this task; benchmark at 70B+ instead.

### Gap 3: Agent doesn't recover from "skipped" turns (no PR yet)

Even if the model emitted `<tool_call>` in turn 1 and the file edit succeeded, fixture 1's oracle (cargo test) would have passed (the model's code is correct). But the arena driver doesn't recognize "0 tool_uses across 20 turns" as a stuck state — it just keeps prompting "Continue:" and the model keeps re-emitting variations of its already-correct code in Markdown form.

A future arena improvement: detect "N consecutive turns with `result.kind == 'skipped'`" and either (a) inject a more explicit prompt ("you MUST emit a tool_call to make progress"), or (b) terminate early with a more diagnostic outcome (`agent_text_loop` instead of `oracle_failed_after_max_turns`).

## Empirical conclusion

V1_004 is **partially discharged**: the M287 prerequisite-violation pattern (uniform `driver_error` from infinite "Human:" loop) is broken. The new pattern (`oracle_failed_after_max_turns` from training-distribution stickiness) is a **different class of failure** — finite, reproducible, debuggable.

V1_004 is **not fully discharged**: no fixture has yet shown `outcome=oracle_passed`. The bench continues; fixtures 2-20 will reveal whether the pattern is uniform (training-distribution-locked across all task types) or sporadic (some fixtures elicit tool_call format).

## Bench timing observed

- 1 fixture / ~26min wall = ~26min/fixture × 20 fixtures ≈ **~8-9hr total**.
- Per-turn pace: 20 turns / ~24min student wall = ~72sec/turn @ max_tokens=1024 = ~14 tok/s (consistent with M286 throughput floor).
- apr serve subprocess: 21GB RSS (MoE model loaded), 830% CPU (8 cores active), 100% GPU utilization. Healthy.

## What this doc is

- Document of the categorical pattern shift (M287 → M291)
- Diagnosis of three independent gaps that still need to close
- Authorization basis for aprender#1853 (Gap 1 fix; ready to merge)

## What this doc is NOT

- NOT a V1_004 discharge — `student_pass_rate > 0` is still the bar; only fixture 1 has data so far
- NOT a guarantee fixtures 2-20 will follow the same `oracle_failed_after_max_turns` pattern — first fixture is one data point
- NOT authorization for Gap 2 / Gap 3 follow-up work — those need operator direction (model class change is a much larger decision)

## Cross-references

- [aprender#1853 (clean_chat_output start-of-string strip)](https://github.com/paiml/aprender/pull/1853)
- [aprender#1852 (EOS stop_token + clean_chat_output in MoE path)](https://github.com/paiml/aprender/pull/1852)
- [aprender#1849 (few-shot prompt examples)](https://github.com/paiml/aprender/pull/1849)
- [M290 follow-up snapshot](v1004-followup-snapshot-2026-05-20.md)
- [M287 bench pattern (greedy baseline)](m32d-bench-pattern-2026-05-20.md)
Loading