Skip to content

docs(M287): M32d bench pattern — agent-quality bottleneck observed mid-run#255

Merged
noahgift merged 1 commit into
mainfrom
m287-m32d-empirical-pattern
May 20, 2026
Merged

docs(M287): M32d bench pattern — agent-quality bottleneck observed mid-run#255
noahgift merged 1 commit into
mainfrom
m287-m32d-empirical-pattern

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Mid-run evidence doc capturing the empirical pattern from the V1_004 bench dispatched at 05:58Z with the post-M32d apr binary (`4a04aaef9`). After 3 fixtures completed: uniform `driver_error, turns_before_error: 4` pattern. M32d enabled 4× turn throughput but the 30B-MoE is too verbose to land a passing edit within budget.

Headline

M32d converted V1_004's blocker from infrastructure to agent quality. Pre-M32d: 0 turns per fixture (engine too slow). Post-M32d: 4 turns per fixture (engine fast; model verbose). Higher-quality problem class.

What this doc captures

  • 3-fixture pattern table (teacher passes, student gets 4 turns then timeout)
  • Turn-1 content excerpt showing model rambles instead of tool-calling
  • Empirical shift table (pre-M32d vs post-M32d)
  • 6 paths forward (sampling via aprender#1837, lower max_tokens, prompt scaffolding, etc.)
  • Status reconciliation across V1_001..V1_004

What this is NOT

  • NOT a V1_004 discharge celebration — bench is still running; no fixture has passed yet
  • NOT a regression — M32d's purpose was met
  • NOT final — doc will be updated if any fixture surprises with `oracle_passed`

Mechanical evidence doc. M-counter NOT bumped per discipline doctrine.

Test plan

  • `bash scripts/test-doc-drift.sh` (if reachable) — clean
  • CI gate + workspace-test (pre-existing inherited failure)

🤖 Generated with Claude Code

…d-run

V1_004 treatment bench dispatched at 05:58Z with post-M32d apr binary
(4a04aaef9). After 3 fixtures completed: uniform pattern of
`driver_error, turns_before_error: 4`. M32d enabled 4× turn throughput
vs pre-M32d (0 turns) — but the 30B-MoE student is too verbose to
land a passing edit within 4 turns × 900s per-turn timeout.

## Empirical shift

| | Pre-M32d | Post-M32d |
|---|---|---|
| Turns completed before exit 124 | 0 | 4 |
| Token-rate via HTTP smoke | ~0.5 tok/s | 7.3 tok/s |
| **Bottleneck** | infrastructure (full-prefill) | agent quality (verbosity) |

M32d converted the V1_004 blocker from "engine too slow" to "agent
doesn't converge fast enough." Higher-quality problem class.

## What this doc captures

- Empirical evidence from 3 completed fixtures + 1 in-progress
- Turn-1 content excerpt showing model rambles instead of tool-calling
- 6 paths forward (lowering temperature via #1837, lowering max_tokens
  cap, prompt template edit, repetition penalty, different student,
  prompt scaffolding)
- Status reconciliation across V1_001..V1_004

## What this is NOT

- NOT a V1_004 discharge celebration — the bench is still running and
  no fixture has passed yet
- NOT a regression — M32d's purpose was met; V1_004 was always
  conditional on student model capability
- NOT a final result — doc will be updated if any fixture surprises
  with oracle_passed

Mechanical evidence doc. M-counter NOT bumped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 39b2a27 into main May 20, 2026
1 check failed
@noahgift noahgift deleted the m287-m32d-empirical-pattern branch May 20, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant