Conversation
This adds a raw-document comparison path that can score models with different tokenizers on the same corpora and attribute the gap back to datasets and byte-level patterns. The Marin wrapper wires in default raw eval sets and a concrete Marin-vs-Llama experiment so the report can run on Iris.
Top literal rows now carry representative token-boundary displays for model A and model B, including tokens that cross the literal edge. The scoring path threads the aligned token spans through the report builder so the display reflects real document context instead of isolated retokenization.
Make perplexity-gap steps rerun when model or tokenizer inputs change, and fail fast on incompatible checkpoint/tokenizer pairs. Also upload the HTML/JSON report bundle to W&B so completed runs are easier to inspect.
|
Claude finished @dlwh's task in 3m 1s —— View job Code review
Reviewed for bugs and CLAUDE.md/AGENTS.md compliance. One high-signal issue found and posted as an inline comment:
Other flagged candidates (local imports in |
| metadata = chat_row.get("metadata", {}) | ||
| if isinstance(metadata, Mapping): | ||
| model_patch = _normalize_text(str(metadata.get("model_patch", ""))) | ||
| if model_patch: | ||
| sections.append("Final Patch:\n" + model_patch) |
There was a problem hiding this comment.
Bug: model_patch=None renders as the literal string "None" in the raw-text target.
At line 411, "model_patch": row.get("model_patch") stores None in metadata whenever the source row has a null patch (common for unresolved SWE-rebench trajectories). Here, metadata.get("model_patch", "") returns that None (the "" default is not used because the key exists), then str(None) → "None", which is truthy after _normalize_text, so a spurious Final Patch:\nNone block is appended to the perplexity-scored text for every unresolved trajectory.
Suggested fix:
| metadata = chat_row.get("metadata", {}) | |
| if isinstance(metadata, Mapping): | |
| model_patch = _normalize_text(str(metadata.get("model_patch", ""))) | |
| if model_patch: | |
| sections.append("Final Patch:\n" + model_patch) | |
| metadata = chat_row.get("metadata", {}) | |
| if isinstance(metadata, Mapping): | |
| raw_patch = metadata.get("model_patch") | |
| model_patch = _normalize_text(raw_patch) if isinstance(raw_patch, str) else "" | |
| if model_patch: | |
| sections.append("Final Patch:\n" + model_patch) |
b42c66a to
2693852
Compare
Trace masked evals can lose all completed metrics when a late dataset or Hugging Face request fails. Checkpoint completed datasets, retry dataset evaluation, and resume from existing results so agent trace probes can survive preemption and transient failures.
|
superseded. |
Normalize curated chat, agent-trace, and reasoning sources into reusable OpenAI-chat artifacts, then expose raw-text projections for perplexity-gap runs. Adds trace-masked evaluation for assistant/tool/observation/final-patch slices plus outcome-label probes over labeled SWE/OpenHands/SWE-agent/CoderForge trajectory datasets. This is stacked on #4962 so the gap finder remains a separate PR.
Part of #4963