Skip to content

[evals] Add base-model PPL eval sets for chat and agent data #4963

@dlwh

Description

@dlwh

Description

Follow-on from #4934 and complementary to #4961.

Add raw eval datasets for three missing capability families where base-model perplexity is a useful proxy: RL readiness / agent traces, chat, and numeracy / ICL / reasoning. Raw technical text belongs in #4961.

RL Readiness / Agent Traces

Use these for PPL on next action, tool call, bash command, or patch text.

  • SWE-bench / SWE-bench Verified: best source for issue -> patch and repo-level bugfix readiness. Good for patch prediction and code-edit distributions.
  • OpenHands trajectories on SWE-rebench: direct Codex-like traces with assistant messages, tool calls, bash execution, observations, and final diffs. Best direct source for assistant/tool/bash/patch trace PPL. Caveat: generated by one agent scaffold/model family.
  • ToolBench: large tool-use corpus with solution paths and API calls. Best for function/tool calling and planning. Caveat: heavily synthetic.
  • Terminal-Bench: best source for bash/terminal task distributions. It ships tasks, oracle solutions, and harnesses, not a big trace corpus.
  • τ-Bench / τ²-Bench: good for multi-turn tool-agent-user interaction outside coding. Better for tool use in dialogue than for bash.

If we want one minimal starter set here, use SWE-bench + OpenHands trajectories + ToolBench + Terminal-Bench.

Chat

Use these for flattened turn-by-turn chat PPL.

  • WildChat: best chat-in-the-wild source. Real user-chatbot conversations, broad and noisy.
  • LMSYS-Chat-1M: 1M real-world conversations across many models. Good broad assistant-style distribution.
  • OpenAssistant / oasst1: cleaner conversation trees and prompts; good for high-quality assistant-style chat.
  • LIMA train split: 1K high-quality prompt/response pairs from the LIMA paper. Good small high-signal assistant-style likelihood slice for alignment-style chat. Caveat: gated; for our base-model PPL work we want train.jsonl, not the paper's 300-prompt held-out test.jsonl.
  • Anthropic HH-RLHF: useful for helpful/harmless/refusal/safety-ish chat registers.
  • MT-Bench prompts in FastChat: small held-out chat eval, not a corpus.

Minimal starter set: WildChat + LMSYS-Chat-1M + OASST1, then add HH-RLHF and the LIMA train split for safety/helpfulness style and a small high-quality alignment slice.

Numeracy / ICL / Reasoning

Use these as prompted conditional-likelihood evals: demonstrations + query -> target.

  • GSM8K: basic multistep arithmetic and word-problem numeracy.
  • MGSM: multilingual GSM8K-style numeracy. Good immediate multilingual reasoning add.
  • MATH: harder math with step-by-step solutions. Very useful, but note the Hugging Face copy is currently disabled behind a DMCA notice.
  • PRM800k: step-level labels on MATH solutions. Best source if we want reasoning-trace quality rather than only final answers.
  • Natural Instructions / Super-NaturalInstructions: strongest open source for task-conditioned generalization and instruction-shaped ICL.
  • P3 / Promptsource: broad prompted-task collection. Best broad source for few-shot and prompted conditional-likelihood evals.
  • BIG-bench: good held-out reasoning/ICL task pool, though more benchmark-y than corpus-y.

Minimal starter set: GSM8K + MGSM + Natural Instructions + P3, then add MATH / PRM800k for harder reasoning.

My recommended first pass is:

That gives strong coverage of patching, tools, bash-ish agents, chat, numeracy, multilingual reasoning, and ICL without trying to boil the ocean in one shot.

Definition of Done

  • Register an initial eval suite covering agent_traces, chat, and reasoning_icl.
  • Each dataset preserves the surface form needed for likelihood evaluation and records source and task metadata.
  • The raw eval flow exposes at least one dataset from each family behind an explicit opt-in or curated default.
  • Pairwise perplexity-gap runs can report gaps by family and dataset for these sources.
  • Document source selection, licensing constraints, and preprocessing needed to keep evaluation comparable across models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentp2Do before next release

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions