docs(M287): M32d bench pattern — agent-quality bottleneck observed mid-run by noahgift · Pull Request #255 · paiml/claude-code-parity-apr

noahgift · 2026-05-20T08:01:32Z

Summary

Mid-run evidence doc capturing the empirical pattern from the V1_004 bench dispatched at 05:58Z with the post-M32d apr binary (`4a04aaef9`). After 3 fixtures completed: uniform `driver_error, turns_before_error: 4` pattern. M32d enabled 4× turn throughput but the 30B-MoE is too verbose to land a passing edit within budget.

Headline

M32d converted V1_004's blocker from infrastructure to agent quality. Pre-M32d: 0 turns per fixture (engine too slow). Post-M32d: 4 turns per fixture (engine fast; model verbose). Higher-quality problem class.

What this doc captures

3-fixture pattern table (teacher passes, student gets 4 turns then timeout)
Turn-1 content excerpt showing model rambles instead of tool-calling
Empirical shift table (pre-M32d vs post-M32d)
6 paths forward (sampling via aprender#1837, lower max_tokens, prompt scaffolding, etc.)
Status reconciliation across V1_001..V1_004

What this is NOT

NOT a V1_004 discharge celebration — bench is still running; no fixture has passed yet
NOT a regression — M32d's purpose was met
NOT final — doc will be updated if any fixture surprises with `oracle_passed`

Mechanical evidence doc. M-counter NOT bumped per discipline doctrine.

Test plan

`bash scripts/test-doc-drift.sh` (if reachable) — clean
CI gate + workspace-test (pre-existing inherited failure)

🤖 Generated with Claude Code

…d-run V1_004 treatment bench dispatched at 05:58Z with post-M32d apr binary (4a04aaef9). After 3 fixtures completed: uniform pattern of `driver_error, turns_before_error: 4`. M32d enabled 4× turn throughput vs pre-M32d (0 turns) — but the 30B-MoE student is too verbose to land a passing edit within 4 turns × 900s per-turn timeout. ## Empirical shift | | Pre-M32d | Post-M32d | |---|---|---| | Turns completed before exit 124 | 0 | 4 | | Token-rate via HTTP smoke | ~0.5 tok/s | 7.3 tok/s | | **Bottleneck** | infrastructure (full-prefill) | agent quality (verbosity) | M32d converted the V1_004 blocker from "engine too slow" to "agent doesn't converge fast enough." Higher-quality problem class. ## What this doc captures - Empirical evidence from 3 completed fixtures + 1 in-progress - Turn-1 content excerpt showing model rambles instead of tool-calling - 6 paths forward (lowering temperature via #1837, lowering max_tokens cap, prompt template edit, repetition penalty, different student, prompt scaffolding) - Status reconciliation across V1_001..V1_004 ## What this is NOT - NOT a V1_004 discharge celebration — the bench is still running and no fixture has passed yet - NOT a regression — M32d's purpose was met; V1_004 was always conditional on student model capability - NOT a final result — doc will be updated if any fixture surprises with oracle_passed Mechanical evidence doc. M-counter NOT bumped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift merged commit 39b2a27 into main May 20, 2026
1 check failed

noahgift deleted the m287-m32d-empirical-pattern branch May 20, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(M287): M32d bench pattern — agent-quality bottleneck observed mid-run#255

docs(M287): M32d bench pattern — agent-quality bottleneck observed mid-run#255
noahgift merged 1 commit into
mainfrom
m287-m32d-empirical-pattern

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Headline

What this doc captures

What this is NOT

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant