Skip to content

[evals] Add capability raw validation slices#4995

Closed
dlwh wants to merge 9 commits intomainfrom
agent/20260420-fix-4963
Closed

[evals] Add capability raw validation slices#4995
dlwh wants to merge 9 commits intomainfrom
agent/20260420-fix-4963

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 21, 2026

Normalize curated chat, agent-trace, and reasoning sources into reusable OpenAI-chat artifacts, then expose raw-text projections for perplexity-gap runs. Adds trace-masked evaluation for assistant/tool/observation/final-patch slices plus outcome-label probes over labeled SWE/OpenHands/SWE-agent/CoderForge trajectory datasets. This is stacked on #4962 so the gap finder remains a separate PR.

Part of #4963

dlwh added 3 commits April 19, 2026 15:08
This adds a raw-document comparison path that can score models with different tokenizers on the same corpora and attribute the gap back to datasets and byte-level patterns. The Marin wrapper wires in default raw eval sets and a concrete Marin-vs-Llama experiment so the report can run on Iris.
Top literal rows now carry representative token-boundary displays for model A and model B, including tokens that cross the literal edge. The scoring path threads the aligned token spans through the report builder so the display reflects real document context instead of isolated retokenization.
Make perplexity-gap steps rerun when model or tokenizer inputs change, and fail fast on incompatible checkpoint/tokenizer pairs. Also upload the HTML/JSON report bundle to W&B so completed runs are easier to inspect.
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 21, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @dlwh's task in 3m 1s —— View job


Code review

  • Check if PR needs review
  • Identify relevant CLAUDE.md/AGENTS.md files
  • Summarize PR changes
  • Review changes (4 parallel agents)
  • Validate flagged issues
  • Post review findings

Reviewed for bugs and CLAUDE.md/AGENTS.md compliance. One high-signal issue found and posted as an inline comment:

  • experiments/evals/raw_capability_eval_sets.py:527metadata.get("model_patch", "") returns None (not "") when the key is present with value None (common for unresolved SWE-rebench trajectories). str(None) then produces the literal string "None", causing a spurious Final Patch:\nNone block in the perplexity-scored raw text.

Other flagged candidates (local imports in experiments/defaults.py, tautological tests, swallowed JSONDecodeError, datetime.now() in a row builder) were evaluated but did not clear the high-signal bar — the local imports group is legitimized by a real circular dep via exp1600_uncheatable_evals, the except json.JSONDecodeError clause is intentional control flow (string fallback), and the tests carry enough behavioral assertions on tags and formats.
· Branch: agent/20260420-fix-4963

Comment on lines +525 to +529
metadata = chat_row.get("metadata", {})
if isinstance(metadata, Mapping):
model_patch = _normalize_text(str(metadata.get("model_patch", "")))
if model_patch:
sections.append("Final Patch:\n" + model_patch)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: model_patch=None renders as the literal string "None" in the raw-text target.

At line 411, "model_patch": row.get("model_patch") stores None in metadata whenever the source row has a null patch (common for unresolved SWE-rebench trajectories). Here, metadata.get("model_patch", "") returns that None (the "" default is not used because the key exists), then str(None)"None", which is truthy after _normalize_text, so a spurious Final Patch:\nNone block is appended to the perplexity-scored text for every unresolved trajectory.

Suggested fix:

Suggested change
metadata = chat_row.get("metadata", {})
if isinstance(metadata, Mapping):
model_patch = _normalize_text(str(metadata.get("model_patch", "")))
if model_patch:
sections.append("Final Patch:\n" + model_patch)
metadata = chat_row.get("metadata", {})
if isinstance(metadata, Mapping):
raw_patch = metadata.get("model_patch")
model_patch = _normalize_text(raw_patch) if isinstance(raw_patch, str) else ""
if model_patch:
sections.append("Final Patch:\n" + model_patch)

@dlwh dlwh force-pushed the agent/20260420-fix-4963 branch from b42c66a to 2693852 Compare April 21, 2026 08:46
Base automatically changed from codex/perplexity-gap-finder to main April 21, 2026 21:08
dlwh added 5 commits April 22, 2026 11:05
Trace masked evals can lose all completed metrics when a late dataset or Hugging Face request fails. Checkpoint completed datasets, retry dataset evaluation, and resume from existing results so agent trace probes can survive preemption and transient failures.
@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented Apr 23, 2026

superseded.

@dlwh dlwh closed this Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant