Style Outweighs Substance: Failure modes of LLM judges in alignment benchmarking (ICLR 2025)

This codebase stores the complete artifacts and describes how to reproduce or extend the results from the paper "Style Outweighs Substance: Failure modes of LLM judges in alignment benchmarking".

UPDATE (03-31-25): The larger artifacts, such as raw model responses and judgments, can now be found in the HuggingFace repository associated with this paper. We have also added detailed SOS-Bench results for hundreds of 8B models. Last but not least, we have added a HuggingFace collection with all of the models we trained for this paper.

List of Benchmarks

In this table, you can find the complete list of benchmarks we use in SOS-Bench, along with the codebase necessary to run them. Below, we describe how to work with each codebase.

Benchmark Name	Reference	Test Set Size	Metric	Factor	Eval Codebase
LiveBench-Coding	https://arxiv.org/abs/2406.19314	130	Exact Match Acc	WK	LiveBench
LiveBench-Data Analysis	https://arxiv.org/abs/2406.19314	150	Exact Match Acc	WK	LiveBench
LiveBench-Instruction Following	https://arxiv.org/abs/2406.19314	200	Exact Match Acc	IF	LiveBench
LiveBench-Language	https://arxiv.org/abs/2406.19314	140	Exact Match Acc	WK	LiveBench
LiveBench-Math	https://arxiv.org/abs/2406.19314	230	Exact Match Acc	WK	LiveBench
LiveBench-Reasoning	https://arxiv.org/abs/2406.19314	150	Exact Match Acc	WK	LiveBench
IFEval	https://arxiv.org/abs/2311.07911	540	Avg of Custom Metrics	IF	Eleuther
MATH Lvl 5	https://arxiv.org/abs/2103.03874	1000	Exact Match Acc	WK	Eleuther
MuSR	https://arxiv.org/abs/2310.16049	750	Acc	WK	Eleuther
GPQA	https://arxiv.org/abs/2311.12022	1250	Acc	WK	Eleuther
MMLU-Pro	https://arxiv.org/abs/2406.01574	12000	Acc	WK	Eleuther
BBH	https://arxiv.org/abs/2210.09261	6750	Acc	WK	Eleuther
BeaverTails	https://arxiv.org/abs/2307.04657	1400	Acc	Safety	Eleuther
CDNA	https://huggingface.co/datasets/walledai/CDNA	2730	Acc	Safety	Eleuther
DTToxicity	https://huggingface.co/datasets/walledai/DTToxicity	4800	Acc	Safety	Eleuther
JailbreakHub	https://arxiv.org/abs/2308.03825	15100	Acc	Safety	Eleuther
BBQ	https://arxiv.org/abs/2110.08193	58500	Acc	Safety	Eleuther
WMDP	https://arxiv.org/abs/2403.03218	3670	Inverse Acc	Safety	Eleuther
XSTest	https://arxiv.org/abs/2308.01263	450	Acc	Safety	Eleuther
WildGuardTest	https://arxiv.org/abs/2406.18495	1730	Acc	Safety	Eleuther
Toxigen	https://arxiv.org/abs/2203.09509	9900	Acc	Safety	Eleuther
StrongREJECT	https://arxiv.org/abs/2402.10260	310	Acc	Safety	Eleuther
SGXSTest	https://huggingface.co/datasets/walledai/SGXSTest	100	Acc	Safety	Eleuther
SaladBench	https://arxiv.org/abs/2402.05044	30400	Acc	Safety	Eleuther

List of Artifacts in this Repository

Here is a brief description of our result artifacts.

Eleuther Results

Filenames: eleuther_wandb.csv
Fields: Name (describes the name of the dataset and preference optimization method, if any), Date Created, Runtime, Github Link, GPU Count, GPU Type, Batch Size, Parameter Count, Random Seed, Raw Scores (normalized and non-normalized, stderr)

Arena-Hard-Auto Results

Filenames: arena_hard_auto.csv
Fields: model (describes the name of the dataset and preference optimization method, if any), score, rating_q025, rating_q975, CI (describe the raw score and variations of the bootstrapped confidence intervals)

LiveBench Results

Filenames: livebench_groups.csv, livebench_tasks.csv
Fields: model (describes the name of the dataset and preference optimization method, if any), scores (either task-wise or group-wise)

How to Run SOS-Bench

The entirety of SOS-Bench can be run as a two-stage process; the first set of benchmarks can be completed using a fork of the Eleuther AI Harness, and the second set can be run using the LiveBench codebase.

Eleuther

pip install lm_eval[wandb,vllm,math,ifeval], sentencepiece
python install_nltk_punkt.py
Git clone our Eleuther AI Harness fork which contains additional tasks
cd lm-evaluation-harness
pip install -e .
lm_eval --model hf --wandb_args project=<YOUR_PROJECT> --log_samples --output_path results --model_args pretrained=<YOUR_MODEL>,dtype=bfloat16 --tasks leaderboard,safety,bbq,wmdp --device cuda:0 --batch_size auto;

LiveBench

Git clone the LiveBench repository
Follow the instructions provided in the repository readme.

Citation

@misc{feuer2024styleoutweighssubstancefailure,
      title={Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking}, 
      author={Benjamin Feuer and Micah Goldblum and Teresa Datta and Sanjana Nambiar and Raz Besaleli and Samuel Dooley and Max Cembalest and John P. Dickerson},
      year={2024},
      eprint={2409.15268},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2409.15268}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
artifacts		artifacts
figures		figures
LICENSE		LICENSE
README.md		README.md
install_nltk_punkt.py		install_nltk_punkt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Style Outweighs Substance: Failure modes of LLM judges in alignment benchmarking (ICLR 2025)

List of Benchmarks

List of Artifacts in this Repository

Eleuther Results

Arena-Hard-Auto Results

LiveBench Results

How to Run SOS-Bench

Eleuther

LiveBench

Citation

About

Uh oh!

Releases

Packages

Languages

License

penfever/sos-bench

Folders and files

Latest commit

History

Repository files navigation

Style Outweighs Substance: Failure modes of LLM judges in alignment benchmarking (ICLR 2025)

List of Benchmarks

List of Artifacts in this Repository

Eleuther Results

Arena-Hard-Auto Results

LiveBench Results

How to Run SOS-Bench

Eleuther

LiveBench

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages