This is the code repository for the paper ROC-n-reroll: How verifier imperfection affects test-time scaling (ICLR 2026).
This repository contains code to reproduce test-time scaling (TTS) results using Best-of-N and Rejection Sampling.
This codebase supports three workflows:
- Generate answers for a specific question with configurable generator parameters.
- Verify saved generations with a verifier model.
- Evaluate & plot best-of-n and rejection-sampling results from saved generations/verifications.
All core logic lives in src/llm_tts/. The scripts in src/scripts/ are thin CLI wrappers.
- GSM8K (
gsm8k) - MATH500 (
math500) - GPQA (
gpqa_main_cot_n_shot)
git clone https://github.com/socialfoundations/roc-n-reroll.git && cd roc-n-reroll
pip install . # core dependencies
pip install ".[dev]" # adds matplotlib, seaborn (needed for plotting)Scripts
Generate multiple completions for a single question.
Key flags:
--task,--question-idx--generator,--num-samples
Score previously generated answers with a verifier model.
Key flags:
--generations-path--verifier
Use saved generate+verify outputs to compute best-of-n or rejection sampling for a single question.
Key flags:
--verified-path--best-of-nand/or--verifier-threshold
Plot evaluation outputs (ROC, BoN, RS) from saved verified results.
Key flags:
--results-dir--task,--question-idx,--generator--output-dir
Examples
python src/scripts/run_generator_specific_question.py \
--task gsm8k \
--question-idx 7 \
--generator Qwen/Qwen3-1.7B \
--num-samples 3 \
--results-dir results \
--models-dir ~/hf_models \
--chat-templateThis produces results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42.json.
python src/scripts/run_verifier_on_saved_generations.py \
--generations-path results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42.json \
--verifier Qwen/Qwen3-4B \
--results-dir results \
--models-dir ~/hf_models \
--chat-templateThis produces results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42_verified-by-qwen3_4b_scoring_1x_seed-42.json.
python src/scripts/run_tts_on_single_question_saved_results.py \
--verified-path results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42_verified-by-qwen3_4b_scoring_1x_seed-42.json \
--best-of-n 3 \
--verifier-threshold 0.95 \
--results-dir resultsThe tracked results under results/ have 10,000 generations and multiple verifiers,
which is enough data to produce meaningful plots:
python src/scripts/plot_tts_using_saved_results.py \
--results-dir results \
--task gsm8k \
--question-idx 7 \
--generator qwen3_1_7b \
--output-dir imgs \
--use-labelsTracked results under results/ cover GSM8K question indices 2 and 7 for
generators qwen3_1_7b and qwen3_4b, each verified by five Qwen3 verifier
sizes (1.7B, 4B, 8B, 14B, 32B). Generate any additional outputs using the CLI
scripts.
Plotting requires matplotlib and seaborn (install via requirements/dev.txt).
docs/cli.md-- CLI referencedocs/data-formats.md-- JSON data formatsdocs/repro.md-- Reproduction guidedocs/example_end_to_end_commands.md-- End-to-end examples
If you find this code useful, please cite:
@inproceedings{
dorner2026rocnreroll,
title={{ROC}-n-reroll: How verifier imperfection affects test-time scaling},
author={Florian E. Dorner and Yatong Chen and Andr{\'e} F Cruz and Fanny Yang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=3Gy5mmyuxn}
}