Skip to content

fix(#221): return actual conversation in ScenarioResult.messages instead of judge context#553

Merged
drewdrewthis merged 3 commits into
mainfrom
issue221/fix-scenario-result-messages
Jun 11, 2026
Merged

fix(#221): return actual conversation in ScenarioResult.messages instead of judge context#553
drewdrewthis merged 3 commits into
mainfrom
issue221/fix-scenario-result-messages

Conversation

@drewdrewthis

Copy link
Copy Markdown
Collaborator

Problem

Regression introduced in 0.7.15: ScenarioResult.messages returns the judge's internal LLM context (system prompt + transcript-as-text) instead of the actual conversation between the user simulator and the agent under test.

This breaks any downstream code that inspects result.messages for tool calls, assistant responses, or conversation logging. Closes #221.

Root Cause

_parse_response, _run_discovery_loop, and _force_verdict all passed the judge's local messages list (system prompt + synthesised transcript text) to ScenarioResult. The actual conversation was in AgentInput.messages, accessible only in the top-level call() method.

Fix

Thread input_messages (the real conversation) from call() down through:

  • _run_discovery_loop(input_messages=...)
  • _force_verdict(input_messages=...)
  • _parse_response(input_messages=...) ← used here in ScenarioResult

The judge's internal messages list is unchanged for LLM calls; only ScenarioResult.messages changes.

Test

test_judge_result_messages_is_conversation_not_judge_context in tests/test_judge_agent.py verifies a 3-message conversation (user/assistant/user) appears in result.messages — not the judge's 2-message internal context.

All 37 existing judge tests continue to pass (updated 2 test call-sites for _force_verdict and _parse_response that called these methods directly with the new required keyword arg).

Checklist

  • Regression test added
  • All existing judge tests pass
  • ScenarioResult.messages is now the actual conversation in all code paths (standard, large-trace discovery, force-verdict fallback)

🤖 Generated with Claude Code

…e context

Before this fix (regression introduced in 0.7.15), ScenarioResult.messages
contained the judge's internal LLM context — the system prompt and the
transcript-as-text — instead of the actual conversation messages
(input.messages).

The root cause: _parse_response, _run_discovery_loop, and _force_verdict all
passed the judge's local `messages` list (system prompt + transcript text) to
ScenarioResult. The actual conversation was in AgentInput.messages, which was
only accessible in the top-level call() method.

Fix: thread input_messages through call() → _run_discovery_loop →
_force_verdict → _parse_response and use it in every ScenarioResult
constructor. The judge's internal messages list is preserved for LLM calls;
only the ScenarioResult changes.

Regression test: test_judge_result_messages_is_conversation_not_judge_context
verifies that a 3-message real conversation (user/assistant/user) appears in
result.messages, not the judge's internal 2-message context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drewdrewthis drewdrewthis added grinding Grinder is actively managing this PR low-risk-change PR qualifies as low-risk per policy and can be merged without manual review labels May 25, 2026
@github-actions github-actions Bot removed the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label May 25, 2026
…ages params

Pyright flags List[dict] as incompatible with List[ChatCompletionMessageParam]
(invariant parameter mismatch). Switch all three input_messages parameters
(_run_discovery_loop, _force_verdict, _parse_response) to Sequence[Any]
which is covariant and accepts any sequence of messages.

Also cast real_conversation to Any in the test to satisfy Pyright — the
test uses plain dict literals but AgentInput.messages expects
List[ChatCompletionMessageParam].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drewdrewthis drewdrewthis changed the title fix(#221): ScenarioResult.messages returns conversation, not judge's internal context fix(#221): return actual conversation in ScenarioResult.messages instead of judge context May 25, 2026
@drewdrewthis

Copy link
Copy Markdown
Collaborator Author

[grinder] READY for human review

CI: green (zero failing, zero pending)
ACs: met — ScenarioResult.messages now contains the actual conversation (input.messages) instead of the judge's internal LLM context (system prompt + transcript text); fix applied across all code paths: standard call, large-trace discovery loop, and force-verdict fallback; regression test verifies 3-message conversation appears in result
Threads: zero unresolved
Note: LLM evaluator declined auto-approve (modifies runtime behavior in judge_agent.py) — human review required

Verified by:
`command gh pr checks 553` → all checks pass or skipping; test (3.12) pass 5m37s, Validate PR Title pass, evaluate pass
37 judge tests pass in worktree (.worktrees/issue221)

@drewdrewthis drewdrewthis added pr-ready and removed grinding Grinder is actively managing this PR labels May 25, 2026
…-result-messages

# Conflicts:
#	python/tests/test_judge_agent.py
@github-actions github-actions Bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Jun 10, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure.

  • Scope: Thread input_messages from call() through _run_discovery_loop, _force_verdict, and _parse_response; set ScenarioResult.messages to the real conversation (input_messages) instead of the judge's internal LLM messages; update tests and call sites accordingly.
  • Exclusions confirmed: no changes to auth, security settings, database schema, business-critical logic, or external integrations.
  • Classification: low-risk-change under the documented policy.

The change threads the actual conversation (input_messages) through JudgeAgent internals and sets ScenarioResult.messages to that conversation instead of the judge's internal LLM context; it also updates tests and a few internal call signatures. The diff is limited to message handling and test updates and does not touch authentication, secrets, database schemas/migrations, business‑critical logic, or external integrations.

An approving review has been submitted by automation. The PR may merge once required CI checks pass.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved by automation: PR qualifies as low-risk-change under the documented policy.

@drewdrewthis drewdrewthis added the slack-requested Slack PR review request posted label Jun 11, 2026
@drewdrewthis drewdrewthis merged commit b32125f into main Jun 11, 2026
21 checks passed
@drewdrewthis drewdrewthis deleted the issue221/fix-scenario-result-messages branch June 11, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

low-risk-change PR qualifies as low-risk per policy and can be merged without manual review pr-ready slack-requested Slack PR review request posted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] ScenarioResult.messages returns judge's internal messages instead of full conversation in 0.7.15

1 participant