[evals] Port agent trace scoring to labeled spans by dlwh · Pull Request #5724 · marin-community/marin

dlwh · 2026-05-14T01:04:40Z

Port the agent trace scoring path onto the labeled-token evaluation framework. Trace chat preprocessing now emits exclusive loss labels and Marin gets a trace_labeled_eval runner that reuses LabeledEvaluator instead of maintaining a parallel mask evaluator.

dlwh · 2026-05-14T01:05:16Z

🤖 Specification

Problem
PR 5248 explored trace-masked agent scoring, but it introduced trace-specific overlapping masks and a parallel evaluator path. With the new labeled eval framework, agent traces can use exclusive token labels and named aggregate metrics instead of carrying a separate mask evaluator. The main surfaces are lib/levanter/src/levanter/data/text/trace_chat.py, lib/levanter/src/levanter/data/text/datasets.py, and lib/marin/src/marin/evaluation/trace_labeled_eval.py.

Approach
Add tokenizer message-span recovery so trace preprocessing can label rendered message regions. TraceChatProcessor emits one loss_labels array using exclusive categories for assistant text, assistant tool calls, observations, patches, outcomes, and final assistant responses, then TraceChatEvaluationFormat exposes a LossLabelSpec with aggregate names matching the old metric shape. Marin trace_labeled_eval builds trace caches, converts them to LabeledLmExample, and scores them with LabeledEvaluator.

Key code
TraceChatDataset rolls token labels to next-token loss positions and clears labels across packed segment boundaries. TraceChatProcessor gives explicit message loss_tags priority for derived patch/outcome targets, while role-based labels cover normal assistant/tool trace messages. This preserves the useful agent trace metrics without reintroducing overlapping masks.

Tests
lib/levanter/tests/test_text_chat.py covers tokenizer message spans and exclusive trace labels. tests/evals/test_trace_labeled_eval.py covers executor-step tracker wiring, limited row reads, and row adaptation for patch/outcome labels. The branch passed ./infra/pre-commit.py --all-files --fix, the Levanter chat/eval tests, and the Marin trace_labeled_eval tests.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0452f839b3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-15T19:24:55Z

+            role = message.get("role")
+            if role == "assistant":


Honor include_role_tags before assigning role labels

When TraceChatEvaluationFormat(include_role_tags=False) is used, untagged assistant/tool messages are still labeled here because _loss_labels never checks self.include_role_tags after failing to find an explicit message label. That makes datasets intended to score only explicit loss_tags still include assistant/tool/observation tokens in the labeled metrics, so those losses are wrong for that configuration.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-15T19:24:55Z

+        messages.append(
+            {
+                "role": "assistant",
+                "content": label,
+                "loss_tags": [row_adapter.outcome_loss_tag],


Insert the outcome prompt before scoring the label

TraceRowAdapterConfig exposes outcome_prompt and even defaults it to an instruction for judging CORRECT/INCORRECT, but the adapted row appends only the assistant label message. As a result, outcome evaluations measure the model's next-token probability of CORRECT/INCORRECT immediately after the trace/patch instead of after the configured judge prompt, and custom outcome_prompt values have no effect.

Useful? React with 👍 / 👎.

dlwh added the agent-generated Created by automation/agent label May 14, 2026

Base automatically changed from codex/labeled-eval-spans to main May 14, 2026 23:58

dlwh marked this pull request as ready for review May 15, 2026 19:21

dlwh added 4 commits May 15, 2026 12:22

[evals] Port agent trace scoring to labeled spans

efcd3af

[evals] Use renamed labeled evaluator factory

c248985

[evals] Fix trace chat template rendering

ebd035e

[evals] Break trace label import cycle

bda192c

chatgpt-codex-connector Bot reviewed May 15, 2026

View reviewed changes

dlwh force-pushed the codex/agent-trace-labeled-eval branch from 0452f83 to bda192c Compare May 15, 2026 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Port agent trace scoring to labeled spans#5724

[evals] Port agent trace scoring to labeled spans#5724
dlwh wants to merge 4 commits into
mainfrom
codex/agent-trace-labeled-eval

dlwh commented May 14, 2026

Uh oh!

dlwh commented May 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented May 14, 2026

Uh oh!

dlwh commented May 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant