Description
Follow-on from #4934 and complementary to #4961.
Add raw eval datasets for three missing capability families where base-model perplexity is a useful proxy: RL readiness / agent traces, chat, and numeracy / ICL / reasoning. Raw technical text belongs in #4961.
RL Readiness / Agent Traces
Use these for PPL on next action, tool call, bash command, or patch text.
- SWE-bench / SWE-bench Verified: best source for issue -> patch and repo-level bugfix readiness. Good for patch prediction and code-edit distributions.
- OpenHands trajectories on SWE-rebench: direct Codex-like traces with assistant messages, tool calls, bash execution, observations, and final diffs. Best direct source for assistant/tool/bash/patch trace PPL. Caveat: generated by one agent scaffold/model family.
- ToolBench: large tool-use corpus with solution paths and API calls. Best for function/tool calling and planning. Caveat: heavily synthetic.
- Terminal-Bench: best source for bash/terminal task distributions. It ships tasks, oracle solutions, and harnesses, not a big trace corpus.
- τ-Bench / τ²-Bench: good for multi-turn tool-agent-user interaction outside coding. Better for tool use in dialogue than for bash.
If we want one minimal starter set here, use SWE-bench + OpenHands trajectories + ToolBench + Terminal-Bench.
Chat
Use these for flattened turn-by-turn chat PPL.
- WildChat: best chat-in-the-wild source. Real user-chatbot conversations, broad and noisy.
- LMSYS-Chat-1M: 1M real-world conversations across many models. Good broad assistant-style distribution.
- OpenAssistant / oasst1: cleaner conversation trees and prompts; good for high-quality assistant-style chat.
- LIMA train split: 1K high-quality prompt/response pairs from the LIMA paper. Good small high-signal assistant-style likelihood slice for alignment-style chat. Caveat: gated; for our base-model PPL work we want
train.jsonl, not the paper's 300-prompt held-out test.jsonl.
- Anthropic HH-RLHF: useful for helpful/harmless/refusal/safety-ish chat registers.
- MT-Bench prompts in FastChat: small held-out chat eval, not a corpus.
Minimal starter set: WildChat + LMSYS-Chat-1M + OASST1, then add HH-RLHF and the LIMA train split for safety/helpfulness style and a small high-quality alignment slice.
Numeracy / ICL / Reasoning
Use these as prompted conditional-likelihood evals: demonstrations + query -> target.
- GSM8K: basic multistep arithmetic and word-problem numeracy.
- MGSM: multilingual GSM8K-style numeracy. Good immediate multilingual reasoning add.
- MATH: harder math with step-by-step solutions. Very useful, but note the Hugging Face copy is currently disabled behind a DMCA notice.
- PRM800k: step-level labels on MATH solutions. Best source if we want reasoning-trace quality rather than only final answers.
- Natural Instructions / Super-NaturalInstructions: strongest open source for task-conditioned generalization and instruction-shaped ICL.
- P3 / Promptsource: broad prompted-task collection. Best broad source for few-shot and prompted conditional-likelihood evals.
- BIG-bench: good held-out reasoning/ICL task pool, though more benchmark-y than corpus-y.
Minimal starter set: GSM8K + MGSM + Natural Instructions + P3, then add MATH / PRM800k for harder reasoning.
My recommended first pass is:
That gives strong coverage of patching, tools, bash-ish agents, chat, numeracy, multilingual reasoning, and ICL without trying to boil the ocean in one shot.
Definition of Done
- Register an initial eval suite covering
agent_traces, chat, and reasoning_icl.
- Each dataset preserves the surface form needed for likelihood evaluation and records source and task metadata.
- The raw eval flow exposes at least one dataset from each family behind an explicit opt-in or curated default.
- Pairwise perplexity-gap runs can report gaps by family and dataset for these sources.
- Document source selection, licensing constraints, and preprocessing needed to keep evaluation comparable across models.
Description
Follow-on from #4934 and complementary to #4961.
Add raw eval datasets for three missing capability families where base-model perplexity is a useful proxy: RL readiness / agent traces, chat, and numeracy / ICL / reasoning. Raw technical text belongs in #4961.
RL Readiness / Agent Traces
Use these for PPL on next action, tool call, bash command, or patch text.
If we want one minimal starter set here, use SWE-bench + OpenHands trajectories + ToolBench + Terminal-Bench.
Chat
Use these for flattened turn-by-turn chat PPL.
train.jsonl, not the paper's 300-prompt held-outtest.jsonl.Minimal starter set: WildChat + LMSYS-Chat-1M + OASST1, then add HH-RLHF and the LIMA train split for safety/helpfulness style and a small high-quality alignment slice.
Numeracy / ICL / Reasoning
Use these as prompted conditional-likelihood evals: demonstrations + query -> target.
Minimal starter set: GSM8K + MGSM + Natural Instructions + P3, then add MATH / PRM800k for harder reasoning.
My recommended first pass is:
That gives strong coverage of patching, tools, bash-ish agents, chat, numeracy, multilingual reasoning, and ICL without trying to boil the ocean in one shot.
Definition of Done
agent_traces,chat, andreasoning_icl.