LitmusValues: An Evaluation Pipeline to Reveal AI Value Preference

Link: Paper | HuggingFace Data

Pre-requisites

Please do pip install -r requirements.txt ideally in a conda/venv environment
Have a relevant API_KEY ready for the model you would like to evaluate from openai, anthropic, togetherai, xai, or openrouter

Run generation on AI Risk Dilemmas

Given model a set of AI risk dilemmas, we ask the models to choose one of two action choices.

Arguments:

--api_provider, -ap (required): Choose from openai, anthropic, togetherai, xai, or openrouter.
--api_key, -ak (required): API key for the selected provider.
--model, -m (required): Name of the model to use.
--generations_dir, -g (optional): Directory to save output generations. Default is generations.
--num_parallel_request, -n (optional): Number of parallel requests to make. Default is 1.
--debug, -d (optional): Run in debug mode with only 5 examples.

Example:

python run_ai_risk_dilemmas.py --api_provider openai --model gpt-4o --api_key sk-...

Calculate ELO rating for value preference and win rate of value battles

Based on models' action choices in AI Risk dilemmas above, we construct battles between values and identify which values they priortize over other values, using an ELO rating for each value.

Arguments:

--model, -m (required): Name of the model to evaluate.
--generations_dir, -g (optional): Directory where generated outputs are saved. Default is generations.
--elo_rating_dir, -e (optional): Directory to save ELO rating results. Default is elo_rating.

Example:

python calculate_elo_rating.py --model gpt-4o

Optional: Visualization of ELO rating on value preference per model

Visualizing the model's revealed value preference from the ELO rating calculation above. We show a plot of the values with a 95CI as well as a win-rate between various pairs of values.

Arguments:

--model, -m (required): Name of the model to evaluate.
--generations_dir, -g (optional): Directory containing generated outputs. Default is generations.
--output_elo_fig_dir, -f (optional): Directory to save ELO score figures. Default is output_elo_figs.
--output_win_rate_fig_dir, -w (optional): Directory to save win-rate figures. Default is output_win_rate_figs.

Example:

python visualize_elo_rating.py --model gpt-4o

If you find this code useful, please cite the following works

@misc{chiu2025aitellliessave,
      title={Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas}, 
      author={Yu Ying Chiu and Zhilin Wang and Sharan Maiya and Yejin Choi and Kyle Fish and Sydney Levine and Evan Hubinger},
      year={2025},
      eprint={2505.14633},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.14633}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calculate_elo_rating.py		calculate_elo_rating.py
requirements.txt		requirements.txt
run_ai_risk_dilemmas.py		run_ai_risk_dilemmas.py
visualize_elo_rating.py		visualize_elo_rating.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LitmusValues: An Evaluation Pipeline to Reveal AI Value Preference

Link: Paper | HuggingFace Data

Pre-requisites

Run generation on AI Risk Dilemmas

Arguments:

Example:

Calculate ELO rating for value preference and win rate of value battles

Arguments:

Example:

Optional: Visualization of ELO rating on value preference per model

Arguments:

Example:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LitmusValues: An Evaluation Pipeline to Reveal AI Value Preference

Link: Paper | HuggingFace Data

Pre-requisites

Run generation on AI Risk Dilemmas

Arguments:

Example:

Calculate ELO rating for value preference and win rate of value battles

Arguments:

Example:

Optional: Visualization of ELO rating on value preference per model

Arguments:

Example:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages