This codebase stores the complete artifacts and describes how to reproduce or extend the results from the paper "Style Outweighs Substance: Failure modes of LLM judges in alignment benchmarking".
UPDATE (03-31-25): The larger artifacts, such as raw model responses and judgments, can now be found in the HuggingFace repository associated with this paper. We have also added detailed SOS-Bench results for hundreds of 8B models. Last but not least, we have added a HuggingFace collection with all of the models we trained for this paper.
In this table, you can find the complete list of benchmarks we use in SOS-Bench, along with the codebase necessary to run them. Below, we describe how to work with each codebase.
| Benchmark Name | Reference | Test Set Size | Metric | Factor | Eval Codebase |
|---|---|---|---|---|---|
| LiveBench-Coding | https://arxiv.org/abs/2406.19314 | 130 | Exact Match Acc | WK | LiveBench |
| LiveBench-Data Analysis | https://arxiv.org/abs/2406.19314 | 150 | Exact Match Acc | WK | LiveBench |
| LiveBench-Instruction Following | https://arxiv.org/abs/2406.19314 | 200 | Exact Match Acc | IF | LiveBench |
| LiveBench-Language | https://arxiv.org/abs/2406.19314 | 140 | Exact Match Acc | WK | LiveBench |
| LiveBench-Math | https://arxiv.org/abs/2406.19314 | 230 | Exact Match Acc | WK | LiveBench |
| LiveBench-Reasoning | https://arxiv.org/abs/2406.19314 | 150 | Exact Match Acc | WK | LiveBench |
| IFEval | https://arxiv.org/abs/2311.07911 | 540 | Avg of Custom Metrics | IF | Eleuther |
| MATH Lvl 5 | https://arxiv.org/abs/2103.03874 | 1000 | Exact Match Acc | WK | Eleuther |
| MuSR | https://arxiv.org/abs/2310.16049 | 750 | Acc | WK | Eleuther |
| GPQA | https://arxiv.org/abs/2311.12022 | 1250 | Acc | WK | Eleuther |
| MMLU-Pro | https://arxiv.org/abs/2406.01574 | 12000 | Acc | WK | Eleuther |
| BBH | https://arxiv.org/abs/2210.09261 | 6750 | Acc | WK | Eleuther |
| BeaverTails | https://arxiv.org/abs/2307.04657 | 1400 | Acc | Safety | Eleuther |
| CDNA | https://huggingface.co/datasets/walledai/CDNA | 2730 | Acc | Safety | Eleuther |
| DTToxicity | https://huggingface.co/datasets/walledai/DTToxicity | 4800 | Acc | Safety | Eleuther |
| JailbreakHub | https://arxiv.org/abs/2308.03825 | 15100 | Acc | Safety | Eleuther |
| BBQ | https://arxiv.org/abs/2110.08193 | 58500 | Acc | Safety | Eleuther |
| WMDP | https://arxiv.org/abs/2403.03218 | 3670 | Inverse Acc | Safety | Eleuther |
| XSTest | https://arxiv.org/abs/2308.01263 | 450 | Acc | Safety | Eleuther |
| WildGuardTest | https://arxiv.org/abs/2406.18495 | 1730 | Acc | Safety | Eleuther |
| Toxigen | https://arxiv.org/abs/2203.09509 | 9900 | Acc | Safety | Eleuther |
| StrongREJECT | https://arxiv.org/abs/2402.10260 | 310 | Acc | Safety | Eleuther |
| SGXSTest | https://huggingface.co/datasets/walledai/SGXSTest | 100 | Acc | Safety | Eleuther |
| SaladBench | https://arxiv.org/abs/2402.05044 | 30400 | Acc | Safety | Eleuther |
Here is a brief description of our result artifacts.
Filenames: eleuther_wandb.csv
Fields: Name (describes the name of the dataset and preference optimization method, if any), Date Created, Runtime, Github Link, GPU Count, GPU Type, Batch Size, Parameter Count, Random Seed, Raw Scores (normalized and non-normalized, stderr)
Filenames: arena_hard_auto.csv
Fields: model (describes the name of the dataset and preference optimization method, if any), score, rating_q025, rating_q975, CI (describe the raw score and variations of the bootstrapped confidence intervals)
Filenames: livebench_groups.csv, livebench_tasks.csv
Fields: model (describes the name of the dataset and preference optimization method, if any), scores (either task-wise or group-wise)
The entirety of SOS-Bench can be run as a two-stage process; the first set of benchmarks can be completed using a fork of the Eleuther AI Harness, and the second set can be run using the LiveBench codebase.
pip install lm_eval[wandb,vllm,math,ifeval], sentencepiecepython install_nltk_punkt.py- Git clone our Eleuther AI Harness fork which contains additional tasks
cd lm-evaluation-harnesspip install -e .lm_eval --model hf --wandb_args project=<YOUR_PROJECT> --log_samples --output_path results --model_args pretrained=<YOUR_MODEL>,dtype=bfloat16 --tasks leaderboard,safety,bbq,wmdp --device cuda:0 --batch_size auto;
- Git clone the LiveBench repository
- Follow the instructions provided in the repository readme.
@misc{feuer2024styleoutweighssubstancefailure,
title={Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking},
author={Benjamin Feuer and Micah Goldblum and Teresa Datta and Sanjana Nambiar and Raz Besaleli and Samuel Dooley and Max Cembalest and John P. Dickerson},
year={2024},
eprint={2409.15268},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2409.15268},
}
