docs(M287): M32d bench pattern — agent-quality bottleneck observed mid-run#255
Merged
Conversation
…d-run V1_004 treatment bench dispatched at 05:58Z with post-M32d apr binary (4a04aaef9). After 3 fixtures completed: uniform pattern of `driver_error, turns_before_error: 4`. M32d enabled 4× turn throughput vs pre-M32d (0 turns) — but the 30B-MoE student is too verbose to land a passing edit within 4 turns × 900s per-turn timeout. ## Empirical shift | | Pre-M32d | Post-M32d | |---|---|---| | Turns completed before exit 124 | 0 | 4 | | Token-rate via HTTP smoke | ~0.5 tok/s | 7.3 tok/s | | **Bottleneck** | infrastructure (full-prefill) | agent quality (verbosity) | M32d converted the V1_004 blocker from "engine too slow" to "agent doesn't converge fast enough." Higher-quality problem class. ## What this doc captures - Empirical evidence from 3 completed fixtures + 1 in-progress - Turn-1 content excerpt showing model rambles instead of tool-calling - 6 paths forward (lowering temperature via #1837, lowering max_tokens cap, prompt template edit, repetition penalty, different student, prompt scaffolding) - Status reconciliation across V1_001..V1_004 ## What this is NOT - NOT a V1_004 discharge celebration — the bench is still running and no fixture has passed yet - NOT a regression — M32d's purpose was met; V1_004 was always conditional on student model capability - NOT a final result — doc will be updated if any fixture surprises with oracle_passed Mechanical evidence doc. M-counter NOT bumped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mid-run evidence doc capturing the empirical pattern from the V1_004 bench dispatched at 05:58Z with the post-M32d apr binary (`4a04aaef9`). After 3 fixtures completed: uniform `driver_error, turns_before_error: 4` pattern. M32d enabled 4× turn throughput but the 30B-MoE is too verbose to land a passing edit within budget.
Headline
M32d converted V1_004's blocker from infrastructure to agent quality. Pre-M32d: 0 turns per fixture (engine too slow). Post-M32d: 4 turns per fixture (engine fast; model verbose). Higher-quality problem class.
What this doc captures
What this is NOT
Mechanical evidence doc. M-counter NOT bumped per discipline doctrine.
Test plan
🤖 Generated with Claude Code