[evals] Add capability raw validation slices by dlwh · Pull Request #4995 · marin-community/marin

dlwh · 2026-04-21T06:10:01Z

Normalize curated chat, agent-trace, and reasoning sources into reusable OpenAI-chat artifacts, then expose raw-text projections for perplexity-gap runs. Adds trace-masked evaluation for assistant/tool/observation/final-patch slices plus outcome-label probes over labeled SWE/OpenHands/SWE-agent/CoderForge trajectory datasets. This is stacked on #4962 so the gap finder remains a separate PR.

Part of #4963

This adds a raw-document comparison path that can score models with different tokenizers on the same corpora and attribute the gap back to datasets and byte-level patterns. The Marin wrapper wires in default raw eval sets and a concrete Marin-vs-Llama experiment so the report can run on Iris.

Top literal rows now carry representative token-boundary displays for model A and model B, including tokens that cross the literal edge. The scoring path threads the aligned token spans through the report builder so the display reflects real document context instead of isolated retokenization.

Make perplexity-gap steps rerun when model or tokenizer inputs change, and fail fast on incompatible checkpoint/tokenizer pairs. Also upload the HTML/JSON report bundle to W&B so completed runs are easier to inspect.

claude · 2026-04-21T06:10:17Z

Claude finished @dlwh's task in 3m 1s —— View job

Code review

Reviewed for bugs and CLAUDE.md/AGENTS.md compliance. One high-signal issue found and posted as an inline comment:

experiments/evals/raw_capability_eval_sets.py:527 — metadata.get("model_patch", "") returns None (not "") when the key is present with value None (common for unresolved SWE-rebench trajectories). str(None) then produces the literal string "None", causing a spurious Final Patch:\nNone block in the perplexity-scored raw text.

Other flagged candidates (local imports in experiments/defaults.py, tautological tests, swallowed JSONDecodeError, datetime.now() in a row builder) were evaluated but did not clear the high-signal bar — the local imports group is legitimized by a real circular dep via exp1600_uncheatable_evals, the except json.JSONDecodeError clause is intentional control flow (string fallback), and the tests carry enough behavioral assertions on tags and formats.
· Branch: agent/20260420-fix-4963

claude · 2026-04-21T06:17:33Z

+    metadata = chat_row.get("metadata", {})
+    if isinstance(metadata, Mapping):
+        model_patch = _normalize_text(str(metadata.get("model_patch", "")))
+        if model_patch:
+            sections.append("Final Patch:\n" + model_patch)


Bug: model_patch=None renders as the literal string "None" in the raw-text target.

At line 411, "model_patch": row.get("model_patch") stores None in metadata whenever the source row has a null patch (common for unresolved SWE-rebench trajectories). Here, metadata.get("model_patch", "") returns that None (the "" default is not used because the key exists), then str(None) → "None", which is truthy after _normalize_text, so a spurious Final Patch:\nNone block is appended to the perplexity-scored text for every unresolved trajectory.

Suggested fix:

Suggested change

metadata = chat_row.get("metadata", {})

if isinstance(metadata, Mapping):

model_patch = _normalize_text(str(metadata.get("model_patch", "")))

if model_patch:

sections.append("Final Patch:\n" + model_patch)

metadata = chat_row.get("metadata", {})

if isinstance(metadata, Mapping):

raw_patch = metadata.get("model_patch")

model_patch = _normalize_text(raw_patch) if isinstance(raw_patch, str) else ""

if model_patch:

sections.append("Final Patch:\n" + model_patch)

Trace masked evals can lose all completed metrics when a late dataset or Hugging Face request fails. Checkpoint completed datasets, retry dataset evaluation, and resume from existing results so agent trace probes can survive preemption and transient failures.

dlwh · 2026-04-23T08:01:59Z

superseded.

dlwh added 3 commits April 19, 2026 15:08

[levanter] Fix gap report cache invalidation and artifacts

6de9e5f

Make perplexity-gap steps rerun when model or tokenizer inputs change, and fail fast on incompatible checkpoint/tokenizer pairs. Also upload the HTML/JSON report bundle to W&B so completed runs are easier to inspect.

dlwh added the agent-generated Created by automation/agent label Apr 21, 2026

claude Bot reviewed Apr 21, 2026

View reviewed changes

dlwh mentioned this pull request Apr 21, 2026

[evals] Add base-model PPL eval sets for chat and agent data #4963

Open

[evals] Add raw capability eval datasets

2693852

dlwh force-pushed the agent/20260420-fix-4963 branch from b42c66a to 2693852 Compare April 21, 2026 08:46

claude Bot mentioned this pull request Apr 21, 2026

[evals] Build pretraining checkpoint confidence portfolio #5005

Open

Base automatically changed from codex/perplexity-gap-finder to main April 21, 2026 21:08

claude Bot mentioned this pull request Apr 22, 2026

[evals] Add package and dependency metadata PPL slices #5061

Open

dlwh added 5 commits April 22, 2026 11:05

[evals] Add trace masked outcome probes

1e1b423

[evals] Bound trace eval row reads

6c7e7fc

[evals] Cap trace eval tokenization input

3e6fbc5

[evals] Add contrastive trace outcome scoring

06b9da2

dlwh closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add capability raw validation slices#4995

[evals] Add capability raw validation slices#4995
dlwh wants to merge 9 commits intomainfrom
agent/20260420-fix-4963

dlwh commented Apr 21, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 21, 2026

Uh oh!

dlwh commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

claude Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

dlwh commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dlwh commented Apr 21, 2026 •

edited

Loading

claude Bot commented Apr 21, 2026 •

edited

Loading