Identify a soft proxy for agentic benchmarks to support data-mixture studies

## Summary

This issue proposes replacing expensive agentic evals with a cheaper soft proxy: score models on benchmark-aligned traces, then fit that score to downstream SWE-bench or Terminal-Bench performance. The idea is still plausible, but the current evidence is mixed rather than encouraging: a same-family MATH study over about 15 models reportedly found weak or non-monotonic relationships for both positive-trace loss and success-failure gap, with the gap doing worse than plain loss. The right current read is therefore "interesting hypothesis, not yet validated"; the next useful step is a narrow reproduction on SWE-bench Verified or Terminal-Bench Lite before treating this as a reliable target for data-mixture search.

### Helpful links
- [Issue proposal and MVP](https://github.com/marin-community/marin/issues/4389)
- [Negative prior result on MATH from RohithKuditipudi](https://github.com/marin-community/marin/issues/4389#issuecomment-4183999166)
- [OpenThoughts-TBLite reference from the motivation](https://www.open-thoughts.ai/blog/openthoughts-tblite)


### Motivation

Data-mixture search is hard to scale to agentic targets like SWE-bench and Terminal-Bench because those benchmarks are **expensive, slow, and often near-floor for smaller models**. This creates a gap in today’s training stack: we increasingly know how to vary data mixtures, but we do not have a cheap, dense, benchmark-aligned signal that tells us which mixtures are likely to improve downstream agentic performance. In other words, **there is a missing bridge between data decisions and agentic evals**.

Recent work on soft metrics suggests a possible way to build that bridge. **Conditional loss has been shown in several non-agentic settings to track downstream task performance**, and Composer2 gives early evidence that continued-pretraining loss can relate to later RL reward. Marin experiment #1427 also points in this direction. **If an analogous soft proxy exists for agentic tasks, then we could run large swarms of small-model mixture experiments directly targeting the benchmarks we care about.**

There is also a complementary line of work that tries to make hard evals cheaper rather than replacing them. For example, [OpenThoughts-TBLite](https://www.open-thoughts.ai/blog/openthoughts-tblite) constructs a curated 100-task subset of Terminal-Bench that tracks TB2 while running much faster, making it more practical for debugging and ablations on smaller models. More broadly, [benchmark-subset selection](https://x.com/DimitrisPapail/status/2026531440414925307) asks whether a carefully chosen small subset of items or benchmarks can preserve much of the ranking signal of the full suite.

Our proposed direction is more aggressive: **instead of evaluating a smaller generative benchmark, we want to evaluate a soft proxy that may be cheaper still**. Even a reduced hard benchmark like TBLite still requires full rollout generation and environment interaction, whereas a soft trace-based metric can be computed much more cheaply once the benchmark-aligned traces are fixed.

### Core idea

Instead of evaluating every candidate mixture on full agentic benchmarks, evaluate a **soft trace score** on benchmark-aligned trajectories, then fit a mapping from that score to final benchmark performance.

The key hypothesis is:

> For agentic tasks, there exists a monotone, likely sigmoidal relationship between a model’s loss on successful trajectories and its downstream performance on SWE-bench / Terminal-Bench, as illustrated in the figure below from the Llama paper.

<img width="338" height="308" alt="Image" src="https://github.com/user-attachments/assets/85e54737-283c-4c13-b3e9-693668b5d6d9" />

### What we should test

We will test 2 proxy ideas. (1) is common practice from prior works and (2) could boost signal, as suggested by @Helw150.

1. **Positive-trace loss**: conditional loss on successful trajectories only
2. **Success–failure gap**: loss(failed) − loss(successful)

### Minimum viable experiment

Use this as the smallest useful experiment:

* **Target benchmark:** Start with **SWE-bench Verified (random 100)**, which is simpler and more stable than Terminal-Bench (Lite).
* **Models:** gather open models with both public benchmark scores and logprob access, e.g. from the [OT-agent leaderboard](https://ot-agent-leaderboard.replit.app/)
* **Data:** collect matched successful trajectories for all 100 tasks from the strongest model on the leaderboard
* **Fit:** regress hard benchmark score against each soft proxy
* **Validate:** hold out some models and predict their hard benchmark scores from the soft metric alone

### Success criterion

This proposal is successful if we find a soft proxy that:

1. correlates strongly with SWE-bench/TB performance across models and
2. generalizes to held-out models

### Main risks

* Multi-turn agentic competence may not be captured well by token loss alone
* Positive-trace loss may be too weak; contrastive proxies may be necessary
* The mapping may hold within model families but break across families
* Public trace availability may be the main bottleneck, especially if we are looking for groups of successful and failed traces for the same task from the same model

### Credits
@Helw150 came up with this experiment. @Calvin-Xu inspired me to dig deeper into this idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify a soft proxy for agentic benchmarks to support data-mixture studies #4389

Summary

Helpful links

Motivation

Core idea

What we should test

Minimum viable experiment

Success criterion

Main risks

Credits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Identify a soft proxy for agentic benchmarks to support data-mixture studies #4389

Description

Summary

Helpful links

Motivation

Core idea

What we should test

Minimum viable experiment

Success criterion

Main risks

Credits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions