fix(#221): return actual conversation in ScenarioResult.messages instead of judge context#553
Conversation
…e context Before this fix (regression introduced in 0.7.15), ScenarioResult.messages contained the judge's internal LLM context — the system prompt and the transcript-as-text — instead of the actual conversation messages (input.messages). The root cause: _parse_response, _run_discovery_loop, and _force_verdict all passed the judge's local `messages` list (system prompt + transcript text) to ScenarioResult. The actual conversation was in AgentInput.messages, which was only accessible in the top-level call() method. Fix: thread input_messages through call() → _run_discovery_loop → _force_verdict → _parse_response and use it in every ScenarioResult constructor. The judge's internal messages list is preserved for LLM calls; only the ScenarioResult changes. Regression test: test_judge_result_messages_is_conversation_not_judge_context verifies that a 3-message real conversation (user/assistant/user) appears in result.messages, not the judge's internal 2-message context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ages params Pyright flags List[dict] as incompatible with List[ChatCompletionMessageParam] (invariant parameter mismatch). Switch all three input_messages parameters (_run_discovery_loop, _force_verdict, _parse_response) to Sequence[Any] which is covariant and accepts any sequence of messages. Also cast real_conversation to Any in the test to satisfy Pyright — the test uses plain dict literals but AgentInput.messages expects List[ChatCompletionMessageParam]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
[grinder] READY for human review CI: green (zero failing, zero pending) Verified by: |
…-result-messages # Conflicts: # python/tests/test_judge_agent.py
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure.
An approving review has been submitted by automation. The PR may merge once required CI checks pass. |
Problem
Regression introduced in 0.7.15:
ScenarioResult.messagesreturns the judge's internal LLM context (system prompt + transcript-as-text) instead of the actual conversation between the user simulator and the agent under test.This breaks any downstream code that inspects
result.messagesfor tool calls, assistant responses, or conversation logging. Closes #221.Root Cause
_parse_response,_run_discovery_loop, and_force_verdictall passed the judge's localmessageslist (system prompt + synthesised transcript text) toScenarioResult. The actual conversation was inAgentInput.messages, accessible only in the top-levelcall()method.Fix
Thread
input_messages(the real conversation) fromcall()down through:_run_discovery_loop(input_messages=...)_force_verdict(input_messages=...)_parse_response(input_messages=...)← used here inScenarioResultThe judge's internal
messageslist is unchanged for LLM calls; onlyScenarioResult.messageschanges.Test
test_judge_result_messages_is_conversation_not_judge_contextintests/test_judge_agent.pyverifies a 3-message conversation (user/assistant/user) appears inresult.messages— not the judge's 2-message internal context.All 37 existing judge tests continue to pass (updated 2 test call-sites for
_force_verdictand_parse_responsethat called these methods directly with the new required keyword arg).Checklist
ScenarioResult.messagesis now the actual conversation in all code paths (standard, large-trace discovery, force-verdict fallback)🤖 Generated with Claude Code