LLM Test-Time Scaling

This is the code repository for the paper ROC-n-reroll: How verifier imperfection affects test-time scaling (ICLR 2026).

This repository contains code to reproduce test-time scaling (TTS) results using Best-of-N and Rejection Sampling.

Repository Purpose

This codebase supports three workflows:

Generate answers for a specific question with configurable generator parameters.
Verify saved generations with a verifier model.
Evaluate & plot best-of-n and rejection-sampling results from saved generations/verifications.

All core logic lives in src/llm_tts/. The scripts in src/scripts/ are thin CLI wrappers.

Supported Datasets

GSM8K (gsm8k)
MATH500 (math500)
GPQA (gpqa_main_cot_n_shot)

Installation

git clone https://github.com/socialfoundations/roc-n-reroll.git && cd roc-n-reroll
pip install .          # core dependencies
pip install ".[dev]"   # adds matplotlib, seaborn (needed for plotting)

Scripts

Main script: run_generator_specific_question.py

Generate multiple completions for a single question.

Key flags:

--task, --question-idx
--generator, --num-samples

Main script: run_verifier_on_saved_generations.py

Score previously generated answers with a verifier model.

Key flags:

--generations-path
--verifier

Helper script: run_tts_on_single_question_saved_results.py

Use saved generate+verify outputs to compute best-of-n or rejection sampling for a single question.

Key flags:

--verified-path
--best-of-n and/or --verifier-threshold

Helper plotting script: plot_tts_using_saved_results.py

Plot evaluation outputs (ROC, BoN, RS) from saved verified results.

Key flags:

--results-dir
--task, --question-idx, --generator
--output-dir

Examples

1) Generate 3 samples for a single GSM8K question

python src/scripts/run_generator_specific_question.py \
  --task gsm8k \
  --question-idx 7 \
  --generator Qwen/Qwen3-1.7B \
  --num-samples 3 \
  --results-dir results \
  --models-dir ~/hf_models \
  --chat-template

This produces results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42.json.

2) Verify saved generations

python src/scripts/run_verifier_on_saved_generations.py \
  --generations-path results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42.json \
  --verifier Qwen/Qwen3-4B \
  --results-dir results \
  --models-dir ~/hf_models \
  --chat-template

This produces results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42_verified-by-qwen3_4b_scoring_1x_seed-42.json.

3) Best-of-n and rejection sampling from saved results

python src/scripts/run_tts_on_single_question_saved_results.py \
  --verified-path results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42_verified-by-qwen3_4b_scoring_1x_seed-42.json \
  --best-of-n 3 \
  --verifier-threshold 0.95 \
  --results-dir results

4) TTS evaluation plots (using tracked example results)

The tracked results under results/ have 10,000 generations and multiple verifiers, which is enough data to produce meaningful plots:

python src/scripts/plot_tts_using_saved_results.py \
  --results-dir results \
  --task gsm8k \
  --question-idx 7 \
  --generator qwen3_1_7b \
  --output-dir imgs \
  --use-labels

Example Results

Tracked results under results/ cover GSM8K question indices 2 and 7 for generators qwen3_1_7b and qwen3_4b, each verified by five Qwen3 verifier sizes (1.7B, 4B, 8B, 14B, 32B). Generate any additional outputs using the CLI scripts.

Plotting requires matplotlib and seaborn (install via requirements/dev.txt).

Docs

docs/cli.md -- CLI reference
docs/data-formats.md -- JSON data formats
docs/repro.md -- Reproduction guide
docs/example_end_to_end_commands.md -- End-to-end examples

Citation

If you find this code useful, please cite:

@inproceedings{
dorner2026rocnreroll,
title={{ROC}-n-reroll: How verifier imperfection affects test-time scaling},
author={Florian E. Dorner and Yatong Chen and Andr{\'e} F Cruz and Fanny Yang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=3Gy5mmyuxn}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
docs		docs
imgs		imgs
notebooks		notebooks
requirements		requirements
results		results
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Test-Time Scaling

Repository Purpose

Supported Datasets

Installation

Main script: run_generator_specific_question.py

Main script: run_verifier_on_saved_generations.py

Helper script: run_tts_on_single_question_saved_results.py

Helper plotting script: plot_tts_using_saved_results.py

1) Generate 3 samples for a single GSM8K question

2) Verify saved generations

3) Best-of-n and rejection sampling from saved results

4) TTS evaluation plots (using tracked example results)

Example Results

Docs

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Test-Time Scaling

Repository Purpose

Supported Datasets

Installation

Main script: run_generator_specific_question.py

Main script: run_verifier_on_saved_generations.py

Helper script: run_tts_on_single_question_saved_results.py

Helper plotting script: plot_tts_using_saved_results.py

1) Generate 3 samples for a single GSM8K question

2) Verify saved generations

3) Best-of-n and rejection sampling from saved results

4) TTS evaluation plots (using tracked example results)

Example Results

Docs

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages