Are We on the Right Way for Assessing LLM as a Judge?

Welcome to the repository!🌟 Here we provide a common implementation of our paper 📚"Are We on the Right Way for Assessing LLM as a Judge?". Our work focuses on a noval metric to evaluate the performance of LLM on Judge tasks without human annotation and the pipeline to conduct the benchmark that fit for our metric. We invite the community to engage with our findings and methodology.

Repository Overview:

Environment Setup - Instruction for setting up the necessary environment to run our code.
Benchmark Curation - A detailed process of how we curate our benchmark and how to build your own benchmark with our pipeline.
Run Benchmark - A detailed process of how to test model on our benchmark and how to run and compare with other relative benchmark.

Environment Setup

CUDA Dependencies

Our project do not require certain version of CUDA Toolkit, but if you want to deploy your local LLM to curate benchmark or run benchmark through vllm or ollama or other structure, you need to choose the specific version your model needed.

Python Library Dependencies

Start up by creating a conda environment:

conda create -n sage python=3.10
conda activate sage

We offer two way to Install the necessary Python packages:

conda

conda install -c conda-forge numpy scipy scikit-learn matplotlib pandas tqdm requests jupyter notebook ijson
conda install -c pytorch pytorch torchvision torchaudio pytorch-cuda=11.8
conda install -c huggingface transformers tokenizers
conda install -c conda-forge openai google-generativeai umap-learn

If you have a compatible NVIDIA GPU and want to use GPU-accelerated libraries:

conda install -c rapidsai -c nvidia -c conda-forge cuml cupy

pip

pip install -r requirements.txt

Benchmark Curation

We draw questions from two distinct, high-quality sources. First, we extracted questions from five core categories within the RewardBench2 dataset—namely Factuality, Focus, Precise Instruction Following, Mathematics, and Safety—to establish a foundation of structured evaluation problems. We also sourced a large volume of queries from the WildChat-1m corpus, accroding to the distribution of the tsne map. The questions from both sources are then merged to form the final, comprehensive set of 650 questions. To further curate our benchmark, answers for each question to compare are needed. Here we offer 3 ways to get generated answers from models for the questions: VLLM, OpenRouter API and GeminiAPI. You can find them in sage/code/generate-datasets. After get generated answers from various models, you can combine them with sage/code/generate-datasets/combine-jsons.py.

Run Benchmark

We have prepared our benchmark and processed relative benchmarks in sage/datasets. If you want to run a certain benchmark, you can find the code in sage/code. For example, sage/code/Sage-Easy for running Sage-Easy, or sage/code/Reward-Bench-2 for running Reward-Bench-2. After get the results, you can analyze them with the tools in sage/code/analyze-result to get its IPI and TOV, and get the average IPI and TOV. Specially note that if you want to run the error estimation experiment, you can use the code in sage/code/error-estimation and analyze with the multiturn tools in sage/code/analyze-result.

Thanks for taking the time and effort to check out our repository! We really appreciate your patience in looking through our code🥳!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
datasets		datasets
images-in-readme		images-in-readme
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Are We on the Right Way for Assessing LLM as a Judge?

Environment Setup

CUDA Dependencies

Python Library Dependencies

conda

pip

Benchmark Curation

Run Benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Are We on the Right Way for Assessing LLM as a Judge?

Environment Setup

CUDA Dependencies

Python Library Dependencies

conda

pip

Benchmark Curation

Run Benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages