[evals] Port agent trace scoring to labeled spans#5724
Conversation
|
🤖 Specification Problem Approach Key code Tests |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0452f839b3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| role = message.get("role") | ||
| if role == "assistant": |
There was a problem hiding this comment.
Honor include_role_tags before assigning role labels
When TraceChatEvaluationFormat(include_role_tags=False) is used, untagged assistant/tool messages are still labeled here because _loss_labels never checks self.include_role_tags after failing to find an explicit message label. That makes datasets intended to score only explicit loss_tags still include assistant/tool/observation tokens in the labeled metrics, so those losses are wrong for that configuration.
Useful? React with 👍 / 👎.
| messages.append( | ||
| { | ||
| "role": "assistant", | ||
| "content": label, | ||
| "loss_tags": [row_adapter.outcome_loss_tag], |
There was a problem hiding this comment.
Insert the outcome prompt before scoring the label
TraceRowAdapterConfig exposes outcome_prompt and even defaults it to an instruction for judging CORRECT/INCORRECT, but the adapted row appends only the assistant label message. As a result, outcome evaluations measure the model's next-token probability of CORRECT/INCORRECT immediately after the trace/patch instead of after the configured judge prompt, and custom outcome_prompt values have no effect.
Useful? React with 👍 / 👎.
0452f83 to
bda192c
Compare
Port the agent trace scoring path onto the labeled-token evaluation framework. Trace chat preprocessing now emits exclusive loss labels and Marin gets a trace_labeled_eval runner that reuses LabeledEvaluator instead of maintaining a parallel mask evaluator.