feat(M292): Agent-Text-Loop detector — closes M291 Gap 3#260
Merged
Conversation
…opt-in cap
Closes M291 Gap 3 (arena driver doesn't recover from skipped turns).
Motivated by V1_004 sub-bench B fixture-1 pattern on Qwen3-Coder-30B-A3B:
20 consecutive text-only turns (every turn invocation.kind = "text",
every result.kind = "skipped") — the agent never invoked any tool.
This PR adds:
1. `ArenaOutcome::AgentTextLoop { consecutive_text_turns, last_text_excerpt }`
variant — captures the "talking but not acting" failure class
distinctly from `OracleFailedAfterMaxTurns`.
2. `ArenaSession::with_max_consecutive_text_turns(cap)` builder. cap=0
(default) disables the detector — preserves M287/M291 baseline.
3. `AgentTextLoopState` rolling counter (parallel to ComplianceTrapState):
text invocation increments, non-text resets, cap triggers AgentTextLoop.
4. `--max-consecutive-text-turns` CLI flag on ccpa-arena-bench (default 0).
5. 7 new tests in session::tests.
Opt-in by design: enabling by default would shift outcome distributions
for existing evidence comparisons. Operator decides per-run whether to
trade off early-bailout savings (~6hr × 20 fixtures for V1_004 future
runs) vs uniform 20-turn dispatch baseline.
What this does NOT do:
- Auto-enable in scripts/phase-6-bench.sh (operator-coordinated decision)
- Change compliance_cost_ratio / recovery_rate aggregate semantics
- Discharge V1_004 (still requires student_pass_rate > 0)
- Bump M-counter on cross-reference surfaces (Phase 6 in active bench run)
All 146 ccpa-arena lib tests pass. Doc-drift detector: 17/17.
Refs:
- M291 evidence: evidence/phase-6/v1004-sub-bench-b-pattern-shift-2026-05-21.md
- aprender#1853 (M291 Gap 1 fix; in flight)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Merged
6 tasks
noahgift
added a commit
that referenced
this pull request
May 22, 2026
Adds: - book/ — mdBook source for paiml.github.io/claude-code-parity-apr - .github/workflows/book.yml — CI build + GitHub Pages auto-deploy - README.md restructured for professional landing (badges row, book callout, empirical highlight section, deep-links to book chapters) - .gitignore — book/book/ (generated artifact) Book structure (28 chapters): - Introduction - Overview: what is CCPA, methodology, two paths, architecture - Static path: trace schema, differ, fixtures, bidirectional sensitivity - Arena: overview, phase 5, phase 6, outcome variants - Falsification gates: 20 gates, source-of-truth, behavioral parity, status flow - Empirical findings: V1_004 chain (M286, M287, M291, M292, M294) - Reference: CLI, trace schema, contract YAML, gate IDs - Appendix: academic basis, milestone history, glossary Build locally: mdbook build book/ -> book/book/index.html Deploy: GitHub Pages auto-deploys on push to main when book/ changes. Doc-drift detector: 17/17 drift classes pass. Refs: - evidence/phase-6/v1004-*.md (all sourced into book chapters) - CCPA#259 M291, #260 M292, #261 M293, #262 M294 scope Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ArenaOutcome::AgentTextLoop { consecutive_text_turns, last_text_excerpt }variant — captures the "talking but not acting" failure class distinctly fromOracleFailedAfterMaxTurns.ArenaSession::with_max_consecutive_text_turns(cap)builder +--max-consecutive-text-turnsCLI flag. Opt-in (default0= disabled, preserves M287/M291 baseline).AgentTextLoopStaterolling counter, parallel toComplianceTrapState.session::tests(state machine + integration via MockDriver).Why
V1_004 sub-bench B fixture 1 (M291) recorded 20 consecutive text-only turns on Qwen3-Coder-30B — every
invocation.kind = "text", everyresult.kind = "skipped". The agent emitted prose + Markdown blocks for the entire 20-turn budget without ever touching the file system.Without M292, bench operators pay ~8hr of bench wall to discover a pattern that could be diagnosed at turn 5 (~2hr — a 4× speedup). The post-hoc
OracleFailedAfterMaxTurnsoutcome also conflates "agent worked but produced wrong output" with "agent never engaged the toolchain." M292 separates them.What this does NOT do
scripts/phase-6-bench.sh— operator decides per-run.compliance_cost_ratio/recovery_ratesemantics —AgentTextLoopis a new variant; aggregates treat it as "not oracle_passed."student_pass_rate > 0is still the bar.Test plan
cargo test -p ccpa-arena --lib agent_text_loop— 7/7 new tests passcargo test -p ccpa-arena --lib— all 146 lib tests passcargo clippy -p ccpa-arena --lib --tests --bins -- -D warnings— cleancargo fmt --all -- --check— cleanbash scripts/check-doc-drift.sh— 17/17 drift classesCross-references
evidence/phase-6/v1004-sub-bench-b-pattern-shift-2026-05-21.mdevidence/phase-6/v1004-agent-text-loop-detector-2026-05-21.md🤖 Generated with Claude Code