🚀 JETTS: Judge Evaluation for Test-Time-Scaling

🎉 Accepted to ICML 2025! Read the paper on arXiv

Authors: Yilun Zhou*, Austin Xu*, Peifeng Wang, Caiming Xiong, Shafiq Joty

This repository contains the source code for the JETTS benchmark, introduced in the paper Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators.

🛠️ Setup

📥 Install JETTS

We recommend installing this package in a fresh conda environment, as follows:

conda create -n jetts python=3.12 -y
conda activate jetts
pip install uv
git clone github.com/SalesforceAIResearch/jetts-benchmark
cd jetts-benchmark
uv pip install -e .

Alternatively, if you do not need to modify any source code, you can directly run pip install uv; uv pip install jetts in the command line after creating the conda environment.

📦 Download Model Response Data

We generate and evaluate the responses generated for a set of generator models on our benchmark tasks so that, except for critique-based refinement, you do not need to run any generator models. These data are stored on Google Cloud and freely accessible to anyone, but you need to use the gcloud command line tool to download. You can follow this link to download and install it.

If you are working in this folder, we recommend creating a subfolder for these data files.

mkdir jetts_data
cd jetts_data

# data for reranking and refinement (143MB zipped, 650MB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/reranking_and_refinement.tar.gz .
tar xzf reranking_and_refinement.tar.gz
rm reranking_and_refinement.tar.gz

# data for beam search (6.7GB zipped, 51GB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/beam_search.tar.gz .
# this can take a while; to see a progress bar, you can use "pv beam_search.tar.gz | tar xz" after install "pv"
tar xzf beam_search.tar.gz
rm beam_search.tar.gz

If everything works correctly, you should see the following folder structure:

jetts_data
├─ beam_search
│  └─ (more subfolders)
└─ reranking_and_refinement
   └─ (jsonl files)

Note: We are working on uploading the data files to Huggingface so that they can be downloaded automatically on the fly. Please stay tuned! 🚀

In reranking_and_refinement, each file represents the responses generated by a model for a particular dataset, contains up to 10 responses, named {dataset}_{generator_model}.jsonl.

In beam_search, each subfolder contains the fully expanded beam search trees generated by a model for a particular dataset, named {dataset}_{N}_{M}_{d}_{generator_model}, with 0.jsonl to {L-1}.jsonl corresponding to the L queries in the dataset. N, M and d correspond to the number of initial step samples, beam width and max depth of the search tree, as detailed in Sec. 3.3 of the paper.

🏁 Launch a Judge Model

To benchmark a specific judge, JETTS supports two methods of defining a judge model instance:

An OpenAI-compatible server (e.g., OpenAI models, Together AI models, and model servers launched by vllm serve).
A vllm.LLM object, which is wrapped in jetts.judge.vllm_judge.VllmJudge.

In the demos below, we use vllm serve to launch a model server in the first method. We provide a helper script to make this process easy:

python scripts/launch_judge.py --judge-model [JUDGE]

where [JUDGE] is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.

Short Name	Full Name
prom7b	prometheus-eval/prometheus-7b-v2.0
sc8b	Skywork/Skywork-Critic-Llama-3.1-8B
ob8b	NCSOFT/Llama-3-OffsetBias-8B
thm8b	PKU-ONELab/Themis
prom8x7b	prometheus-eval/prometheus-8x7b-v2.0
sc70b	Skywork/Skywork-Critic-Llama-3.1-70B
ste70b	facebook/Self-taught-evaluator-llama3.1-70B
llama8b	meta-llama/Llama-3.1-8B-Instruct

The judge will be served on localhost:8000, as expected by the script for each task.

Note: Due to company policy, we are not able to release weights for the SFR-Judge family of models or provide API access. We hope to do so in the future and will update this instruction accordingly.

🎯 Running JETTS Tasks

Response Reranking

With reranking_and_refinement data downloaded and judge launched, reranking can be run with

python scripts/reranking.py --data-file [DATA_FILE]

where [DATA_FILE] is path of one of the .jsonl files in the reranking_and_refinement data folder. The script automatically computes the performance at the end and writes a file containing ranked responses for the dataset. The file name is named {judge_model}_{data_file_name}_{reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/reranking folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

Step-Level Beam Search

With beam_search data downloaded and judge launched, beam search can be run with

python scripts/beam_search.py --input-dir [INPUT_DIR]

where [INPUT_DIR] is the path of one of the subfolders inside the beam_search data folder. The script automatically computes the performance at the end and creates a folder and files 0.jsonl to {L-1}.jsonl containing the beam search decisions for each tree. The result folder is named {judge_model}_{input_dir_name}_{beam_selection_reranking_method}_{final_selection_reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/beam_search folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

Critique-Based Refinement

In addition to downloading the reranking_and_refinement data and launching the judge, you also need to launch the generator as we need to perform live response generation following judge critiques. The generator can be launched in a similar manner as the judge, with

python scripts/launch_generator.py --generator-name [GENERATOR]

where [GENERATOR] is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.

Short Name	Full Name
llama8b	meta-llama/Llama-3.1-8B-Instruct
llama70b	meta-llama/Llama-3.1-70B-Instruct
qwen32b	Qwen/Qwen2.5-32B-Instruct
qwen72b	Qwen/Qwen2.5-72B-Instruct

The generator will be served on localhost:8001, as expected by the refinement script, which can be executed as

python scripts/refinement.py --data-file [DATA_FILE]

where [DATA_FILE] is path of one of the .jsonl files in the reranking_and_refinement data folder. The script does not automatically compute the performance at the end but does write a file containing all refined responses in the final reranking order for the dataset. The file name is named {judge_model}_{refiner_model}_{data_file_name}_{final_reranking_method}.jsonl inside the folder specified by --output-dir, which defaults to the outputs/refinement folder. Please consult the script or run it with the -h flag for optional arguments to customize the run.

📊 Evaluating Refinement Result

Note: You do not need to run manual evaluation for reranking and beam search as we have pre-computed the score for all model responses (including every leaf node in the search tree) and saved them with the data files. You only need to run this evaluation for refinement as the responses are freshly generated by the generator model.

In order to not intervene with the packages needed to run the actual benchmarking, we strongly recommend you to run the evalaution in a dedicated conda environment. Furthermore, since BigCodeBench requires many packages in their relatively old versions (e.g., numpy==1.21.2 released in August of 2021), its evaluation interferes with that of other datasets. Thus, we recommend following the steps below to create two environments.

To run evaluations for CHAMP and AlpacaEval, you need to have OPENAI_API_KEY saved as an environment variable (e.g., export OPENAI_API_KEY=[YOUR_API_KEY]). For CHAMP, you also need to keep the generator server (i.e., scripts/launch_generator.py) running at port 8001 during evaluation; the judge server can be terminated.

Note: All evaluations need to be run inside the jetts_eval directory.

Everything except for BigCodeBench

cd jetts_eval
conda create -n jetts-eval python=3.10 -y
conda activate jetts-eval
pip install uv

# You only need to run the line(s) below for the dataset(s) that you wish to evaluate
uv pip install math-verify[antlr4_13_2]  # for GSM8k and MATH
uv pip install champ_dataset datasets openai  # for CHAMP
uv pip install alpaca-eval  # for AlpacaEval
uv pip install numpy absl-py langdetect nltk immutabledict  # for IFEval

# run all lines below for HumanEval+ and MBPP+
cd local_code_eval
tar xzf evalplus.tar.gz
cd evalplus
uv pip install -e .
cd ../..

BigCodeBench

conda create -n jetts-eval-bcb python=3.10 -y
conda activate jetts-eval-bcb
pip install uv
cd local_code_eval
tar xzf bigcodebench.tar.gz
cd bigcodebench
uv pip install -e .
uv pip install -r requirements-eval.txt
cd ../..

Running the evaluation

After you have installed the necessary packages and activated the correct environment, you can run

python evaluate_refinement.py --refinement-output-file [REFINEMENT_OUTPUT_FILE]

where [REFINEMENT_OUTPUT_FILE] is the path to the output file generated by scripts/refinement.py. By default, the file name contains the dataset name. However, if you provide a custom output file without dataset name in it, you need to additionally specify --dataset [DATASET] from the list of [gsm8k, math, champ, humaneval, mbpp, bigcodebench, alpacaeval, ifeval]. The program will print out the score at the end.

❓ Questions?

If you have any questions, feel free to open an issue or contact the authors at [email protected] and [email protected].

📝 Citation

If you use this codebase in a paper, you could cite this work as

@article{zhou2025evaluating,
  title={Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators},
  author={Zhou, Yilun and Xu, Austin and Wang, Peifeng and Xiong, Caiming and Joty, Shafiq},
  journal={arXiv preprint arXiv:2504.15253},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
3rd_party_licenses		3rd_party_licenses
jetts		jetts
jetts_eval		jetts_eval
scripts		scripts
.gitignore		.gitignore
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 JETTS: Judge Evaluation for Test-Time-Scaling

📋 Table of Contents

🛠️ Setup

📥 Install JETTS

📦 Download Model Response Data

🏁 Launch a Judge Model

🎯 Running JETTS Tasks

Response Reranking

Step-Level Beam Search

Critique-Based Refinement

📊 Evaluating Refinement Result

Everything except for BigCodeBench

BigCodeBench

Running the evaluation

❓ Questions?

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

SalesforceAIResearch/jetts-benchmark

Folders and files

Latest commit

History

Repository files navigation

🚀 JETTS: Judge Evaluation for Test-Time-Scaling

📋 Table of Contents

🛠️ Setup

📥 Install JETTS

📦 Download Model Response Data

🏁 Launch a Judge Model

🎯 Running JETTS Tasks

Response Reranking

Step-Level Beam Search

Critique-Based Refinement

📊 Evaluating Refinement Result

Everything except for BigCodeBench

BigCodeBench

Running the evaluation

❓ Questions?

📝 Citation

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages