Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions scripts/phase-6-bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@
# - PHASE6_MAX_TURNS caps multi-turn dialog (defaults 20)
# - PHASE6_WALL_SECONDS caps session wall (defaults 3600s)
# - PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES Compliance-Trap cap (defaults 3)
# - PHASE6_MAX_CONSECUTIVE_TEXT_TURNS Agent-Text-Loop cap (defaults 0 = disabled).
# Opt-in: when >0, session terminates with ArenaOutcome::AgentTextLoop after
# N consecutive text-only turns (no tool_call). See M292 evidence doc.
# Leave at 0 for historical evidence comparisons (M270/M280/M287/M291).
# - PHASE6_COMPLIANCE_ENFORCED (M276): 1 (default, treatment, writes to
# evidence/under-contract/) OR 0 (control baseline, --compliance-enforced
# DISABLED, writes to evidence/under-contract-control/). The treatment
Expand Down Expand Up @@ -72,6 +76,12 @@ PHASE6_MAX_TURNS="${PHASE6_MAX_TURNS:-20}"
PHASE6_WALL_SECONDS="${PHASE6_WALL_SECONDS:-3600}"
PHASE6_ORACLE_INTERVAL="${PHASE6_ORACLE_INTERVAL:-3}"
PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES="${PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES:-3}"
# M292: Agent-Text-Loop detector cap (0 = disabled, default). When >0,
# session terminates with ArenaOutcome::AgentTextLoop after N consecutive
# text-only turns (no tool_call). See evidence/phase-6/v1004-agent-text-loop-detector-2026-05-21.md.
# Operator-opt-in: changes outcome distributions, so existing bench evidence
# comparisons (M270, M280, M287, M291) must stay at 0 for apples-to-apples.
PHASE6_MAX_CONSECUTIVE_TEXT_TURNS="${PHASE6_MAX_CONSECUTIVE_TEXT_TURNS:-0}"

CLAUDE_BIN="$(command -v claude 2>/dev/null || command -v claude-code 2>/dev/null || true)"
APR_BIN="$(command -v apr 2>/dev/null || command -v apr-cli 2>/dev/null || true)"
Expand Down Expand Up @@ -280,6 +290,15 @@ for fixture_dir in $(ls -d "${CORPUS_DIR}"/*/[0-9]*/ 2>/dev/null | sort); do
"--max-consecutive-compliance-failures=${PHASE6_MAX_CONSECUTIVE_COMPLIANCE_FAILURES}"
)
fi
# M292: opt-in Agent-Text-Loop detector. Threaded through only when
# PHASE6_MAX_CONSECUTIVE_TEXT_TURNS > 0 to preserve historical
# outcome distributions for any run that leaves the env var unset.
text_loop_args=()
if [[ "${PHASE6_MAX_CONSECUTIVE_TEXT_TURNS}" -gt 0 ]]; then
text_loop_args+=(
"--max-consecutive-text-turns=${PHASE6_MAX_CONSECUTIVE_TEXT_TURNS}"
)
fi
"${BENCH_BIN}" \
--cwd="${side_cwd}" \
--prompt-file="${prompt_file}" \
Expand All @@ -290,6 +309,7 @@ for fixture_dir in $(ls -d "${CORPUS_DIR}"/*/[0-9]*/ 2>/dev/null | sort); do
--oracle-check-interval="${PHASE6_ORACLE_INTERVAL}" \
--driver-per-turn-timeout="${APR_TIMEOUT_S}" \
"${compliance_args[@]}" \
"${text_loop_args[@]}" \
"${driver_args[@]}" > "${bench_out}" 2> "${fixture_evidence}/${side}.bench.stderr" || bench_status=$?
if [[ -s "${bench_out}" ]]; then
outcome_kind=$(jq -r '.outcome.kind // "unknown"' "${bench_out}" 2>/dev/null || echo "parse_error")
Expand Down
Loading