Skip to content

[tts] Add sample-only reasoning TTS math slice#4774

Draft
taivu1998 wants to merge 3 commits intomarin-community:mainfrom
taivu1998:codex/tts-pr1-math
Draft

[tts] Add sample-only reasoning TTS math slice#4774
taivu1998 wants to merge 3 commits intomarin-community:mainfrom
taivu1998:codex/tts-pr1-math

Conversation

@taivu1998
Copy link
Copy Markdown

Add the first marin.test_time_scaling vertical slice with replayable candidate generation, sample-only selectors, and stable artifact logging for math reasoning runs. This adds a standalone runner and focused tests so we can measure first-sample, majority-vote, and logprob baselines on a shared candidate pool.

Copy link
Copy Markdown
Author

🤖 Specification

Problem
Marin did not have a standalone test-time scaling package for reusable candidate generation, selection, artifact logging, and offline replay. The first roadmap milestone called for a vertical slice that proves the package boundary with a real workload instead of a framework-only PR.

Approach
This change adds marin.test_time_scaling as a standalone package and implements the minimum end-to-end math sample-only path. The core package owns prompt manifests, candidate generation, selectors, scoring, artifacts, and summary analysis. The experiment entrypoint is experiments/evals/run_reasoning_tts.py, which can talk to an existing OpenAI-compatible server or launch VllmEnvironment locally.

Key code
generate.py defines an OpenAI-compatible completion provider and produces replayable CandidateRecord rows with prompt token counts, completion token counts, latency, seed, finish reason, and logprob statistics.

selectors.py implements the three PR 1 sample-only baselines on the same saved pool: first sample, majority vote over extracted answers, and normalized-logprob reranking.

analysis.py replays selectors over saved candidates and builds summary.json, including selector accuracy, oracle accuracy, oracle gap rate, duplicate rate, parse-valid rate, prompt/completion token totals, and request latency totals.

scorers.py reuses Marin’s existing math utilities to extract and grade boxed answers for Tier 1 math prompts.

Tests
Targeted tests were added under tests/test_time_scaling/ for manifest round-trip, selector replay, boxed-answer scoring, summary accounting, and an end-to-end math vertical slice that generates a shared candidate pool and replays all selectors against it.

Validation run locally on the PR branch:
./infra/pre-commit.py --all-files --fix
PYTHONPATH=lib/fray/src:lib/iris/src:lib/levanter/src:lib/marin/src:lib/rigging/src:lib/zephyr/src /Users/vuductai/Documents/Projects/marin/.venv/bin/python -m pytest tests/test_time_scaling -q

The targeted uv run pytest tests/test_time_scaling -m 'not slow' -q path hit an existing import-path mismatch in tests/conftest.py in the temporary worktree environment, so the final targeted validation used the repo’s existing .venv with explicit source roots.

This lands a standalone test_time_scaling package so candidate generation, selector replay, and artifact accounting live outside benchmark-specific evaluation code. It also adds the first math-focused runner and regression tests so we can iterate on sample-only reasoning TTS before moving on to code and verifier stages.
@taivu1998 taivu1998 force-pushed the codex/tts-pr1-math branch from d1d4d2b to 9a0c247 Compare April 22, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant