Skip to content

[evals] Port agent trace scoring to labeled spans#5724

Open
dlwh wants to merge 4 commits into
mainfrom
codex/agent-trace-labeled-eval
Open

[evals] Port agent trace scoring to labeled spans#5724
dlwh wants to merge 4 commits into
mainfrom
codex/agent-trace-labeled-eval

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented May 14, 2026

Port the agent trace scoring path onto the labeled-token evaluation framework. Trace chat preprocessing now emits exclusive loss labels and Marin gets a trace_labeled_eval runner that reuses LabeledEvaluator instead of maintaining a parallel mask evaluator.

@dlwh dlwh added the agent-generated Created by automation/agent label May 14, 2026
@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented May 14, 2026

🤖 Specification

Problem
PR 5248 explored trace-masked agent scoring, but it introduced trace-specific overlapping masks and a parallel evaluator path. With the new labeled eval framework, agent traces can use exclusive token labels and named aggregate metrics instead of carrying a separate mask evaluator. The main surfaces are lib/levanter/src/levanter/data/text/trace_chat.py, lib/levanter/src/levanter/data/text/datasets.py, and lib/marin/src/marin/evaluation/trace_labeled_eval.py.

Approach
Add tokenizer message-span recovery so trace preprocessing can label rendered message regions. TraceChatProcessor emits one loss_labels array using exclusive categories for assistant text, assistant tool calls, observations, patches, outcomes, and final assistant responses, then TraceChatEvaluationFormat exposes a LossLabelSpec with aggregate names matching the old metric shape. Marin trace_labeled_eval builds trace caches, converts them to LabeledLmExample, and scores them with LabeledEvaluator.

Key code
TraceChatDataset rolls token labels to next-token loss positions and clears labels across packed segment boundaries. TraceChatProcessor gives explicit message loss_tags priority for derived patch/outcome targets, while role-based labels cover normal assistant/tool trace messages. This preserves the useful agent trace metrics without reintroducing overlapping masks.

Tests
lib/levanter/tests/test_text_chat.py covers tokenizer message spans and exclusive trace labels. tests/evals/test_trace_labeled_eval.py covers executor-step tracker wiring, limited row reads, and row adaptation for patch/outcome labels. The branch passed ./infra/pre-commit.py --all-files --fix, the Levanter chat/eval tests, and the Marin trace_labeled_eval tests.

Base automatically changed from codex/labeled-eval-spans to main May 14, 2026 23:58
@dlwh dlwh marked this pull request as ready for review May 15, 2026 19:21
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0452f839b3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +299 to +300
role = message.get("role")
if role == "assistant":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor include_role_tags before assigning role labels

When TraceChatEvaluationFormat(include_role_tags=False) is used, untagged assistant/tool messages are still labeled here because _loss_labels never checks self.include_role_tags after failing to find an explicit message label. That makes datasets intended to score only explicit loss_tags still include assistant/tool/observation tokens in the labeled metrics, so those losses are wrong for that configuration.

Useful? React with 👍 / 👎.

Comment on lines +381 to +385
messages.append(
{
"role": "assistant",
"content": label,
"loss_tags": [row_adapter.outcome_loss_tag],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Insert the outcome prompt before scoring the label

TraceRowAdapterConfig exposes outcome_prompt and even defaults it to an instruction for judging CORRECT/INCORRECT, but the adapted row appends only the assistant label message. As a result, outcome evaluations measure the model's next-token probability of CORRECT/INCORRECT immediately after the trace/patch instead of after the configured judge prompt, and custom outcome_prompt values have no effect.

Useful? React with 👍 / 👎.

@dlwh dlwh force-pushed the codex/agent-trace-labeled-eval branch from 0452f83 to bda192c Compare May 15, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant