fix(#221): return actual conversation in ScenarioResult.messages instead of judge context by drewdrewthis · Pull Request #553 · langwatch/scenario

drewdrewthis · 2026-05-25T07:41:39Z

Problem

Regression introduced in 0.7.15: ScenarioResult.messages returns the judge's internal LLM context (system prompt + transcript-as-text) instead of the actual conversation between the user simulator and the agent under test.

This breaks any downstream code that inspects result.messages for tool calls, assistant responses, or conversation logging. Closes #221.

Root Cause

_parse_response, _run_discovery_loop, and _force_verdict all passed the judge's local messages list (system prompt + synthesised transcript text) to ScenarioResult. The actual conversation was in AgentInput.messages, accessible only in the top-level call() method.

Fix

Thread input_messages (the real conversation) from call() down through:

_run_discovery_loop(input_messages=...)
_force_verdict(input_messages=...)
_parse_response(input_messages=...) ← used here in ScenarioResult

The judge's internal messages list is unchanged for LLM calls; only ScenarioResult.messages changes.

Test

test_judge_result_messages_is_conversation_not_judge_context in tests/test_judge_agent.py verifies a 3-message conversation (user/assistant/user) appears in result.messages — not the judge's 2-message internal context.

All 37 existing judge tests continue to pass (updated 2 test call-sites for _force_verdict and _parse_response that called these methods directly with the new required keyword arg).

Checklist

Regression test added
All existing judge tests pass
ScenarioResult.messages is now the actual conversation in all code paths (standard, large-trace discovery, force-verdict fallback)

🤖 Generated with Claude Code

…e context Before this fix (regression introduced in 0.7.15), ScenarioResult.messages contained the judge's internal LLM context — the system prompt and the transcript-as-text — instead of the actual conversation messages (input.messages). The root cause: _parse_response, _run_discovery_loop, and _force_verdict all passed the judge's local `messages` list (system prompt + transcript text) to ScenarioResult. The actual conversation was in AgentInput.messages, which was only accessible in the top-level call() method. Fix: thread input_messages through call() → _run_discovery_loop → _force_verdict → _parse_response and use it in every ScenarioResult constructor. The judge's internal messages list is preserved for LLM calls; only the ScenarioResult changes. Regression test: test_judge_result_messages_is_conversation_not_judge_context verifies that a 3-message real conversation (user/assistant/user) appears in result.messages, not the judge's internal 2-message context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ages params Pyright flags List[dict] as incompatible with List[ChatCompletionMessageParam] (invariant parameter mismatch). Switch all three input_messages parameters (_run_discovery_loop, _force_verdict, _parse_response) to Sequence[Any] which is covariant and accepts any sequence of messages. Also cast real_conversation to Any in the test to satisfy Pyright — the test uses plain dict literals but AgentInput.messages expects List[ChatCompletionMessageParam]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewdrewthis · 2026-05-25T07:58:22Z

[grinder] READY for human review

CI: green (zero failing, zero pending)
ACs: met — ScenarioResult.messages now contains the actual conversation (input.messages) instead of the judge's internal LLM context (system prompt + transcript text); fix applied across all code paths: standard call, large-trace discovery loop, and force-verdict fallback; regression test verifies 3-message conversation appears in result
Threads: zero unresolved
Note: LLM evaluator declined auto-approve (modifies runtime behavior in judge_agent.py) — human review required

Verified by:
`command gh pr checks 553` → all checks pass or skipping; test (3.12) pass 5m37s, Validate PR Title pass, evaluate pass
37 judge tests pass in worktree (.worktrees/issue221)

…-result-messages # Conflicts: # python/tests/test_judge_agent.py

github-actions · 2026-06-10T22:47:31Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure.

Scope: Thread input_messages from call() through _run_discovery_loop, _force_verdict, and _parse_response; set ScenarioResult.messages to the real conversation (input_messages) instead of the judge's internal LLM messages; update tests and call sites accordingly.
Exclusions confirmed: no changes to auth, security settings, database schema, business-critical logic, or external integrations.
Classification: low-risk-change under the documented policy.

The change threads the actual conversation (input_messages) through JudgeAgent internals and sets ScenarioResult.messages to that conversation instead of the judge's internal LLM context; it also updates tests and a few internal call signatures. The diff is limited to message handling and test updates and does not touch authentication, secrets, database schemas/migrations, business‑critical logic, or external integrations.

An approving review has been submitted by automation. The PR may merge once required CI checks pass.

github-actions

Approved by automation: PR qualifies as low-risk-change under the documented policy.

drewdrewthis added grinding Grinder is actively managing this PR low-risk-change PR qualifies as low-risk per policy and can be merged without manual review labels May 25, 2026

github-actions Bot removed the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label May 25, 2026

drewdrewthis changed the title ~~fix(#221): ScenarioResult.messages returns conversation, not judge's internal context~~ fix(#221): return actual conversation in ScenarioResult.messages instead of judge context May 25, 2026

drewdrewthis added pr-ready and removed grinding Grinder is actively managing this PR labels May 25, 2026

Merge remote-tracking branch 'origin/main' into issue221/fix-scenario…

42dfaaf

…-result-messages # Conflicts: # python/tests/test_judge_agent.py

github-actions Bot added the low-risk-change PR qualifies as low-risk per policy and can be merged without manual review label Jun 10, 2026

github-actions Bot approved these changes Jun 10, 2026

View reviewed changes

drewdrewthis requested a review from rogeriochaves June 11, 2026 09:26

drewdrewthis added the slack-requested Slack PR review request posted label Jun 11, 2026

drewdrewthis merged commit b32125f into main Jun 11, 2026
21 checks passed

drewdrewthis deleted the issue221/fix-scenario-result-messages branch June 11, 2026 09:29

rogeriochaves mentioned this pull request Jun 11, 2026

chore(main): release python 0.7.31 #656

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#221): return actual conversation in ScenarioResult.messages instead of judge context#553

fix(#221): return actual conversation in ScenarioResult.messages instead of judge context#553
drewdrewthis merged 3 commits into
mainfrom
issue221/fix-scenario-result-messages

drewdrewthis commented May 25, 2026

Uh oh!

drewdrewthis commented May 25, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewdrewthis commented May 25, 2026

Problem

Root Cause

Fix

Test

Checklist

Uh oh!

drewdrewthis commented May 25, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant