Skip to content

Identify a soft proxy for agentic benchmarks to support data-mixture studies #4389

@AlienKevin

Description

@AlienKevin

Summary

This issue proposes replacing expensive agentic evals with a cheaper soft proxy: score models on benchmark-aligned traces, then fit that score to downstream SWE-bench or Terminal-Bench performance. The idea is still plausible, but the current evidence is mixed rather than encouraging: a same-family MATH study over about 15 models reportedly found weak or non-monotonic relationships for both positive-trace loss and success-failure gap, with the gap doing worse than plain loss. The right current read is therefore "interesting hypothesis, not yet validated"; the next useful step is a narrow reproduction on SWE-bench Verified or Terminal-Bench Lite before treating this as a reliable target for data-mixture search.

Helpful links

Motivation

Data-mixture search is hard to scale to agentic targets like SWE-bench and Terminal-Bench because those benchmarks are expensive, slow, and often near-floor for smaller models. This creates a gap in today’s training stack: we increasingly know how to vary data mixtures, but we do not have a cheap, dense, benchmark-aligned signal that tells us which mixtures are likely to improve downstream agentic performance. In other words, there is a missing bridge between data decisions and agentic evals.

Recent work on soft metrics suggests a possible way to build that bridge. Conditional loss has been shown in several non-agentic settings to track downstream task performance, and Composer2 gives early evidence that continued-pretraining loss can relate to later RL reward. Marin experiment #1427 also points in this direction. If an analogous soft proxy exists for agentic tasks, then we could run large swarms of small-model mixture experiments directly targeting the benchmarks we care about.

There is also a complementary line of work that tries to make hard evals cheaper rather than replacing them. For example, OpenThoughts-TBLite constructs a curated 100-task subset of Terminal-Bench that tracks TB2 while running much faster, making it more practical for debugging and ablations on smaller models. More broadly, benchmark-subset selection asks whether a carefully chosen small subset of items or benchmarks can preserve much of the ranking signal of the full suite.

Our proposed direction is more aggressive: instead of evaluating a smaller generative benchmark, we want to evaluate a soft proxy that may be cheaper still. Even a reduced hard benchmark like TBLite still requires full rollout generation and environment interaction, whereas a soft trace-based metric can be computed much more cheaply once the benchmark-aligned traces are fixed.

Core idea

Instead of evaluating every candidate mixture on full agentic benchmarks, evaluate a soft trace score on benchmark-aligned trajectories, then fit a mapping from that score to final benchmark performance.

The key hypothesis is:

For agentic tasks, there exists a monotone, likely sigmoidal relationship between a model’s loss on successful trajectories and its downstream performance on SWE-bench / Terminal-Bench, as illustrated in the figure below from the Llama paper.

Image

What we should test

We will test 2 proxy ideas. (1) is common practice from prior works and (2) could boost signal, as suggested by @Helw150.

  1. Positive-trace loss: conditional loss on successful trajectories only
  2. Success–failure gap: loss(failed) − loss(successful)

Minimum viable experiment

Use this as the smallest useful experiment:

  • Target benchmark: Start with SWE-bench Verified (random 100), which is simpler and more stable than Terminal-Bench (Lite).
  • Models: gather open models with both public benchmark scores and logprob access, e.g. from the OT-agent leaderboard
  • Data: collect matched successful trajectories for all 100 tasks from the strongest model on the leaderboard
  • Fit: regress hard benchmark score against each soft proxy
  • Validate: hold out some models and predict their hard benchmark scores from the soft metric alone

Success criterion

This proposal is successful if we find a soft proxy that:

  1. correlates strongly with SWE-bench/TB performance across models and
  2. generalizes to held-out models

Main risks

  • Multi-turn agentic competence may not be captured well by token loss alone
  • Positive-trace loss may be too weak; contrastive proxies may be necessary
  • The mapping may hold within model families but break across families
  • Public trace availability may be the main bottleneck, especially if we are looking for groups of successful and failed traces for the same task from the same model

Credits

@Helw150 came up with this experiment. @Calvin-Xu inspired me to dig deeper into this idea.

Metadata

Metadata

Assignees

Labels

experimenttldrIssue has a community-friendly TL;DR summary

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions