[tts] Add sample-only reasoning TTS math slice by taivu1998 · Pull Request #4774 · marin-community/marin

taivu1998 · 2026-04-15T05:08:35Z

Add the first marin.test_time_scaling vertical slice with replayable candidate generation, sample-only selectors, and stable artifact logging for math reasoning runs. This adds a standalone runner and focused tests so we can measure first-sample, majority-vote, and logprob baselines on a shared candidate pool.

taivu1998 · 2026-04-15T05:09:18Z

🤖 Specification

Problem
Marin did not have a standalone test-time scaling package for reusable candidate generation, selection, artifact logging, and offline replay. The first roadmap milestone called for a vertical slice that proves the package boundary with a real workload instead of a framework-only PR.

Approach
This change adds marin.test_time_scaling as a standalone package and implements the minimum end-to-end math sample-only path. The core package owns prompt manifests, candidate generation, selectors, scoring, artifacts, and summary analysis. The experiment entrypoint is experiments/evals/run_reasoning_tts.py, which can talk to an existing OpenAI-compatible server or launch VllmEnvironment locally.

Key code
generate.py defines an OpenAI-compatible completion provider and produces replayable CandidateRecord rows with prompt token counts, completion token counts, latency, seed, finish reason, and logprob statistics.

selectors.py implements the three PR 1 sample-only baselines on the same saved pool: first sample, majority vote over extracted answers, and normalized-logprob reranking.

analysis.py replays selectors over saved candidates and builds summary.json, including selector accuracy, oracle accuracy, oracle gap rate, duplicate rate, parse-valid rate, prompt/completion token totals, and request latency totals.

scorers.py reuses Marin’s existing math utilities to extract and grade boxed answers for Tier 1 math prompts.

Tests
Targeted tests were added under tests/test_time_scaling/ for manifest round-trip, selector replay, boxed-answer scoring, summary accounting, and an end-to-end math vertical slice that generates a shared candidate pool and replays all selectors against it.

Validation run locally on the PR branch:
./infra/pre-commit.py --all-files --fix
PYTHONPATH=lib/fray/src:lib/iris/src:lib/levanter/src:lib/marin/src:lib/rigging/src:lib/zephyr/src /Users/vuductai/Documents/Projects/marin/.venv/bin/python -m pytest tests/test_time_scaling -q

The targeted uv run pytest tests/test_time_scaling -m 'not slow' -q path hit an existing import-path mismatch in tests/conftest.py in the temporary worktree environment, so the final targeted validation used the repo’s existing .venv with explicit source roots.

This lands a standalone test_time_scaling package so candidate generation, selector replay, and artifact accounting live outside benchmark-specific evaluation code. It also adds the first math-focused runner and regression tests so we can iterate on sample-only reasoning TTS before moving on to code and verifier stages.

taivu1998 added 3 commits April 22, 2026 10:34

[tts] Refactor inference and test-time scaling ownership

89aac6f

[tts] Restore math scorer canonicalization on main

9a0c247

taivu1998 force-pushed the codex/tts-pr1-math branch from d1d4d2b to 9a0c247 Compare April 22, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tts] Add sample-only reasoning TTS math slice#4774

[tts] Add sample-only reasoning TTS math slice#4774
taivu1998 wants to merge 3 commits intomarin-community:mainfrom
taivu1998:codex/tts-pr1-math

taivu1998 commented Apr 15, 2026

Uh oh!

taivu1998 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taivu1998 commented Apr 15, 2026

Uh oh!

taivu1998 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant