π Accepted to ICML 2025! Read the paper on arXiv
Authors: Yilun Zhou*, Austin Xu*, Peifeng Wang, Caiming Xiong, Shafiq Joty
This repository contains the source code for the JETTS benchmark, introduced in the paper Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators.
We recommend installing this package in a fresh conda environment, as follows:
conda create -n jetts python=3.12 -y
conda activate jetts
pip install uv
git clone github.com/SalesforceAIResearch/jetts-benchmark
cd jetts-benchmark
uv pip install -e .
Alternatively, if you do not need to modify any source code, you can directly run pip install uv; uv pip install jetts
in the command line after creating the conda environment.
We generate and evaluate the responses generated for a set of generator models on our benchmark tasks so that, except for critique-based refinement, you do not need to run any generator models. These data are stored on Google Cloud and freely accessible to anyone, but you need to use the gcloud
command line tool to download. You can follow this link to download and install it.
If you are working in this folder, we recommend creating a subfolder for these data files.
mkdir jetts_data
cd jetts_data
# data for reranking and refinement (143MB zipped, 650MB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/reranking_and_refinement.tar.gz .
tar xzf reranking_and_refinement.tar.gz
rm reranking_and_refinement.tar.gz
# data for beam search (6.7GB zipped, 51GB extracted)
gcloud storage cp gs://sfr-jetts-benchmark-data/beam_search.tar.gz .
# this can take a while; to see a progress bar, you can use "pv beam_search.tar.gz | tar xz" after install "pv"
tar xzf beam_search.tar.gz
rm beam_search.tar.gz
If everything works correctly, you should see the following folder structure:
jetts_data
ββ beam_search
β ββ (more subfolders)
ββ reranking_and_refinement
ββ (jsonl files)
Note: We are working on uploading the data files to Huggingface so that they can be downloaded automatically on the fly. Please stay tuned! π
In reranking_and_refinement
, each file represents the responses generated by a model for a particular dataset, contains up to 10 responses, named {dataset}_{generator_model}.jsonl
.
In beam_search
, each subfolder contains the fully expanded beam search trees generated by a model for a particular dataset, named {dataset}_{N}_{M}_{d}_{generator_model}
, with 0.jsonl
to {L-1}.jsonl
corresponding to the L
queries in the dataset. N
, M
and d
correspond to the number of initial step samples, beam width and max depth of the search tree, as detailed in Sec. 3.3 of the paper.
To benchmark a specific judge, JETTS supports two methods of defining a judge model instance:
- An OpenAI-compatible server (e.g., OpenAI models, Together AI models, and model servers launched by
vllm serve
). - A
vllm.LLM
object, which is wrapped injetts.judge.vllm_judge.VllmJudge
.
In the demos below, we use vllm serve
to launch a model server in the first method. We provide a helper script to make this process easy:
python scripts/launch_judge.py --judge-model [JUDGE]
where [JUDGE]
is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.
Short Name | Full Name |
---|---|
prom7b | prometheus-eval/prometheus-7b-v2.0 |
sc8b | Skywork/Skywork-Critic-Llama-3.1-8B |
ob8b | NCSOFT/Llama-3-OffsetBias-8B |
thm8b | PKU-ONELab/Themis |
prom8x7b | prometheus-eval/prometheus-8x7b-v2.0 |
sc70b | Skywork/Skywork-Critic-Llama-3.1-70B |
ste70b | facebook/Self-taught-evaluator-llama3.1-70B |
llama8b | meta-llama/Llama-3.1-8B-Instruct |
The judge will be served on localhost:8000
, as expected by the script for each task.
Note: Due to company policy, we are not able to release weights for the SFR-Judge family of models or provide API access. We hope to do so in the future and will update this instruction accordingly.
With reranking_and_refinement
data downloaded and judge launched, reranking can be run with
python scripts/reranking.py --data-file [DATA_FILE]
where [DATA_FILE]
is path of one of the .jsonl
files in the reranking_and_refinement
data folder. The script automatically computes the performance at the end and writes a file containing ranked responses for the dataset. The file name is named {judge_model}_{data_file_name}_{reranking_method}.jsonl
inside the folder specified by --output-dir
, which defaults to the outputs/reranking
folder. Please consult the script or run it with the -h
flag for optional arguments to customize the run.
With beam_search
data downloaded and judge launched, beam search can be run with
python scripts/beam_search.py --input-dir [INPUT_DIR]
where [INPUT_DIR]
is the path of one of the subfolders inside the beam_search
data folder. The script automatically computes the performance at the end and creates a folder and files 0.jsonl
to {L-1}.jsonl
containing the beam search decisions for each tree. The result folder is named {judge_model}_{input_dir_name}_{beam_selection_reranking_method}_{final_selection_reranking_method}.jsonl
inside the folder specified by --output-dir
, which defaults to the outputs/beam_search
folder. Please consult the script or run it with the -h
flag for optional arguments to customize the run.
In addition to downloading the reranking_and_refinement
data and launching the judge, you also need to launch the generator as we need to perform live response generation following judge critiques. The generator can be launched in a similar manner as the judge, with
python scripts/launch_generator.py --generator-name [GENERATOR]
where [GENERATOR]
is one of the short or full names in the table below, or the huggingface model ID for any vllm-supported model.
Short Name | Full Name |
---|---|
llama8b | meta-llama/Llama-3.1-8B-Instruct |
llama70b | meta-llama/Llama-3.1-70B-Instruct |
qwen32b | Qwen/Qwen2.5-32B-Instruct |
qwen72b | Qwen/Qwen2.5-72B-Instruct |
The generator will be served on localhost:8001
, as expected by the refinement script, which can be executed as
python scripts/refinement.py --data-file [DATA_FILE]
where [DATA_FILE]
is path of one of the .jsonl
files in the reranking_and_refinement
data folder. The script does not automatically compute the performance at the end but does write a file containing all refined responses in the final reranking order for the dataset. The file name is named {judge_model}_{refiner_model}_{data_file_name}_{final_reranking_method}.jsonl
inside the folder specified by --output-dir
, which defaults to the outputs/refinement
folder. Please consult the script or run it with the -h
flag for optional arguments to customize the run.
Note: You do not need to run manual evaluation for reranking and beam search as we have pre-computed the score for all model responses (including every leaf node in the search tree) and saved them with the data files. You only need to run this evaluation for refinement as the responses are freshly generated by the generator model.
In order to not intervene with the packages needed to run the actual benchmarking, we strongly recommend you to run the evalaution in a dedicated conda environment. Furthermore, since BigCodeBench requires many packages in their relatively old versions (e.g., numpy==1.21.2
released in August of 2021), its evaluation interferes with that of other datasets. Thus, we recommend following the steps below to create two environments.
To run evaluations for CHAMP and AlpacaEval, you need to have OPENAI_API_KEY
saved as an environment variable (e.g., export OPENAI_API_KEY=[YOUR_API_KEY]
). For CHAMP, you also need to keep the generator server (i.e., scripts/launch_generator.py
) running at port 8001 during evaluation; the judge server can be terminated.
Note: All evaluations need to be run inside the
jetts_eval
directory.
cd jetts_eval
conda create -n jetts-eval python=3.10 -y
conda activate jetts-eval
pip install uv
# You only need to run the line(s) below for the dataset(s) that you wish to evaluate
uv pip install math-verify[antlr4_13_2] # for GSM8k and MATH
uv pip install champ_dataset datasets openai # for CHAMP
uv pip install alpaca-eval # for AlpacaEval
uv pip install numpy absl-py langdetect nltk immutabledict # for IFEval
# run all lines below for HumanEval+ and MBPP+
cd local_code_eval
tar xzf evalplus.tar.gz
cd evalplus
uv pip install -e .
cd ../..
conda create -n jetts-eval-bcb python=3.10 -y
conda activate jetts-eval-bcb
pip install uv
cd local_code_eval
tar xzf bigcodebench.tar.gz
cd bigcodebench
uv pip install -e .
uv pip install -r requirements-eval.txt
cd ../..
After you have installed the necessary packages and activated the correct environment, you can run
python evaluate_refinement.py --refinement-output-file [REFINEMENT_OUTPUT_FILE]
where [REFINEMENT_OUTPUT_FILE]
is the path to the output file generated by scripts/refinement.py
. By default, the file name contains the dataset name. However, if you provide a custom output file without dataset name in it, you need to additionally specify --dataset [DATASET]
from the list of [gsm8k, math, champ, humaneval, mbpp, bigcodebench, alpacaeval, ifeval]
. The program will print out the score at the end.
If you have any questions, feel free to open an issue or contact the authors at [email protected]
and [email protected]
.
If you use this codebase in a paper, you could cite this work as
@article{zhou2025evaluating,
title={Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators},
author={Zhou, Yilun and Xu, Austin and Wang, Peifeng and Xiong, Caiming and Joty, Shafiq},
journal={arXiv preprint arXiv:2504.15253},
year={2025}
}