Skip to content

socialfoundations/roc-n-reroll

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Test-Time Scaling

Tests arXiv

This is the code repository for the paper ROC-n-reroll: How verifier imperfection affects test-time scaling (ICLR 2026).

This repository contains code to reproduce test-time scaling (TTS) results using Best-of-N and Rejection Sampling.

Repository Purpose

This codebase supports three workflows:

  1. Generate answers for a specific question with configurable generator parameters.
  2. Verify saved generations with a verifier model.
  3. Evaluate & plot best-of-n and rejection-sampling results from saved generations/verifications.

All core logic lives in src/llm_tts/. The scripts in src/scripts/ are thin CLI wrappers.

Supported Datasets

  • GSM8K (gsm8k)
  • MATH500 (math500)
  • GPQA (gpqa_main_cot_n_shot)

Installation

git clone https://github.com/socialfoundations/roc-n-reroll.git && cd roc-n-reroll
pip install .          # core dependencies
pip install ".[dev]"   # adds matplotlib, seaborn (needed for plotting)
Scripts

Main script: run_generator_specific_question.py

Generate multiple completions for a single question.

Key flags:

  • --task, --question-idx
  • --generator, --num-samples

Main script: run_verifier_on_saved_generations.py

Score previously generated answers with a verifier model.

Key flags:

  • --generations-path
  • --verifier

Helper script: run_tts_on_single_question_saved_results.py

Use saved generate+verify outputs to compute best-of-n or rejection sampling for a single question.

Key flags:

  • --verified-path
  • --best-of-n and/or --verifier-threshold

Helper plotting script: plot_tts_using_saved_results.py

Plot evaluation outputs (ROC, BoN, RS) from saved verified results.

Key flags:

  • --results-dir
  • --task, --question-idx, --generator
  • --output-dir
Examples

1) Generate 3 samples for a single GSM8K question

python src/scripts/run_generator_specific_question.py \
  --task gsm8k \
  --question-idx 7 \
  --generator Qwen/Qwen3-1.7B \
  --num-samples 3 \
  --results-dir results \
  --models-dir ~/hf_models \
  --chat-template

This produces results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42.json.

2) Verify saved generations

python src/scripts/run_verifier_on_saved_generations.py \
  --generations-path results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42.json \
  --verifier Qwen/Qwen3-4B \
  --results-dir results \
  --models-dir ~/hf_models \
  --chat-template

This produces results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42_verified-by-qwen3_4b_scoring_1x_seed-42.json.

3) Best-of-n and rejection sampling from saved results

python src/scripts/run_tts_on_single_question_saved_results.py \
  --verified-path results/gsm8k_qwen3_1_7b_3-generations-for-question-idx-7_seed-42_verified-by-qwen3_4b_scoring_1x_seed-42.json \
  --best-of-n 3 \
  --verifier-threshold 0.95 \
  --results-dir results

4) TTS evaluation plots (using tracked example results)

The tracked results under results/ have 10,000 generations and multiple verifiers, which is enough data to produce meaningful plots:

python src/scripts/plot_tts_using_saved_results.py \
  --results-dir results \
  --task gsm8k \
  --question-idx 7 \
  --generator qwen3_1_7b \
  --output-dir imgs \
  --use-labels

Example Results

Tracked results under results/ cover GSM8K question indices 2 and 7 for generators qwen3_1_7b and qwen3_4b, each verified by five Qwen3 verifier sizes (1.7B, 4B, 8B, 14B, 32B). Generate any additional outputs using the CLI scripts.

Plotting requires matplotlib and seaborn (install via requirements/dev.txt).

Docs

Citation

If you find this code useful, please cite:

@inproceedings{
dorner2026rocnreroll,
title={{ROC}-n-reroll: How verifier imperfection affects test-time scaling},
author={Florian E. Dorner and Yatong Chen and Andr{\'e} F Cruz and Fanny Yang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=3Gy5mmyuxn}
}

About

Code used for "ROC-n-reroll: How verifier imperfection affects test-time scaling" at ICLR 2026.

Topics

Resources

Stars

Watchers

Forks

Contributors