[evals] Add base-model PPL eval sets for chat and agent data

## Description

Follow-on from #4934 and complementary to #4961.

Add raw eval datasets for three missing capability families where base-model perplexity is a useful proxy: RL readiness / agent traces, chat, and numeracy / ICL / reasoning. Raw technical text belongs in #4961.

### RL Readiness / Agent Traces

Use these for PPL on next action, tool call, bash command, or patch text.

- [SWE-bench / SWE-bench Verified](https://github.com/SWE-bench/SWE-bench): best source for issue -> patch and repo-level bugfix readiness. Good for patch prediction and code-edit distributions.
- [OpenHands trajectories on SWE-rebench](https://huggingface.co/datasets/nebius/SWE-rebench-openhands-trajectories): direct Codex-like traces with assistant messages, tool calls, bash execution, observations, and final diffs. Best direct source for assistant/tool/bash/patch trace PPL. Caveat: generated by one agent scaffold/model family.
- [ToolBench](https://github.com/OpenBMB/ToolBench): large tool-use corpus with solution paths and API calls. Best for function/tool calling and planning. Caveat: heavily synthetic.
- [Terminal-Bench](https://github.com/harbor-framework/terminal-bench): best source for bash/terminal task distributions. It ships tasks, oracle solutions, and harnesses, not a big trace corpus.
- [τ-Bench](https://github.com/sierra-research/tau-bench) / [τ²-Bench](https://github.com/sierra-research/tau2-bench): good for multi-turn tool-agent-user interaction outside coding. Better for tool use in dialogue than for bash.

If we want one minimal starter set here, use [SWE-bench](https://github.com/SWE-bench/SWE-bench) + [OpenHands trajectories](https://huggingface.co/datasets/nebius/SWE-rebench-openhands-trajectories) + [ToolBench](https://github.com/OpenBMB/ToolBench) + [Terminal-Bench](https://github.com/harbor-framework/terminal-bench).

### Chat

Use these for flattened turn-by-turn chat PPL.

- [WildChat](https://huggingface.co/datasets/allenai/WildChat): best chat-in-the-wild source. Real user-chatbot conversations, broad and noisy.
- [LMSYS-Chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m): 1M real-world conversations across many models. Good broad assistant-style distribution.
- [OpenAssistant / oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1): cleaner conversation trees and prompts; good for high-quality assistant-style chat.
- [LIMA train split](https://huggingface.co/datasets/GAIR/lima): 1K high-quality prompt/response pairs from the LIMA paper. Good small high-signal assistant-style likelihood slice for alignment-style chat. Caveat: gated; for our base-model PPL work we want `train.jsonl`, not the paper's 300-prompt held-out `test.jsonl`.
- [Anthropic HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf): useful for helpful/harmless/refusal/safety-ish chat registers.
- [MT-Bench prompts in FastChat](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge): small held-out chat eval, not a corpus.

Minimal starter set: [WildChat](https://huggingface.co/datasets/allenai/WildChat) + [LMSYS-Chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) + [OASST1](https://huggingface.co/datasets/OpenAssistant/oasst1), then add [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) and the [LIMA train split](https://huggingface.co/datasets/GAIR/lima) for safety/helpfulness style and a small high-quality alignment slice.

### Numeracy / ICL / Reasoning

Use these as prompted conditional-likelihood evals: demonstrations + query -> target.

- [GSM8K](https://huggingface.co/datasets/openai/gsm8k): basic multistep arithmetic and word-problem numeracy.
- [MGSM](https://huggingface.co/datasets/CohereLabs/global-mgsm): multilingual GSM8K-style numeracy. Good immediate multilingual reasoning add.
- [MATH](https://huggingface.co/datasets/hendrycks/competition_math): harder math with step-by-step solutions. Very useful, but note the Hugging Face copy is currently disabled behind a DMCA notice.
- [PRM800k](https://github.com/openai/prm800k): step-level labels on MATH solutions. Best source if we want reasoning-trace quality rather than only final answers.
- [Natural Instructions / Super-NaturalInstructions](https://github.com/allenai/natural-instructions): strongest open source for task-conditioned generalization and instruction-shaped ICL.
- [P3 / Promptsource](https://huggingface.co/datasets/bigscience/P3): broad prompted-task collection. Best broad source for few-shot and prompted conditional-likelihood evals.
- [BIG-bench](https://github.com/google/BIG-bench): good held-out reasoning/ICL task pool, though more benchmark-y than corpus-y.

Minimal starter set: [GSM8K](https://huggingface.co/datasets/openai/gsm8k) + [MGSM](https://huggingface.co/datasets/CohereLabs/global-mgsm) + [Natural Instructions](https://github.com/allenai/natural-instructions) + [P3](https://huggingface.co/datasets/bigscience/P3), then add [MATH](https://huggingface.co/datasets/hendrycks/competition_math) / [PRM800k](https://github.com/openai/prm800k) for harder reasoning.

My recommended first pass is:

- [SWE-bench](https://github.com/SWE-bench/SWE-bench)
- [OpenHands trajectories](https://huggingface.co/datasets/nebius/SWE-rebench-openhands-trajectories)
- [ToolBench](https://github.com/OpenBMB/ToolBench)
- [WildChat](https://huggingface.co/datasets/allenai/WildChat)
- [LMSYS-Chat-1M](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
- [LIMA train split](https://huggingface.co/datasets/GAIR/lima)
- [GSM8K](https://huggingface.co/datasets/openai/gsm8k)
- [MGSM](https://huggingface.co/datasets/CohereLabs/global-mgsm)
- [Natural Instructions](https://github.com/allenai/natural-instructions)
- [P3](https://huggingface.co/datasets/bigscience/P3)

That gives strong coverage of patching, tools, bash-ish agents, chat, numeracy, multilingual reasoning, and ICL without trying to boil the ocean in one shot.

### Definition of Done

- Register an initial eval suite covering `agent_traces`, `chat`, and `reasoning_icl`.
- Each dataset preserves the surface form needed for likelihood evaluation and records source and task metadata.
- The raw eval flow exposes at least one dataset from each family behind an explicit opt-in or curated default.
- Pairwise perplexity-gap runs can report gaps by family and dataset for these sources.
- Document source selection, licensing constraints, and preprocessing needed to keep evaluation comparable across models.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add base-model PPL eval sets for chat and agent data #4963

Description

RL Readiness / Agent Traces

Chat

Numeracy / ICL / Reasoning

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[evals] Add base-model PPL eval sets for chat and agent data #4963

Description

Description

RL Readiness / Agent Traces

Chat

Numeracy / ICL / Reasoning

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions