Link: Paper | HuggingFace Data
-
Please do
pip install -r requirements.txtideally in a conda/venv environment -
Have a relevant
API_KEYready for the model you would like to evaluate fromopenai,anthropic,togetherai,xai, oropenrouter
- Given model a set of AI risk dilemmas, we ask the models to choose one of two action choices.
--api_provider, -ap(required): Choose fromopenai,anthropic,togetherai,xai, oropenrouter.--api_key, -ak(required): API key for the selected provider.--model, -m(required): Name of the model to use.--generations_dir, -g(optional): Directory to save output generations. Default isgenerations.--num_parallel_request, -n(optional): Number of parallel requests to make. Default is1.--debug, -d(optional): Run in debug mode with only 5 examples.
python run_ai_risk_dilemmas.py --api_provider openai --model gpt-4o --api_key sk-...- Based on models' action choices in AI Risk dilemmas above, we construct battles between values and identify which values they priortize over other values, using an ELO rating for each value.
--model, -m(required): Name of the model to evaluate.--generations_dir, -g(optional): Directory where generated outputs are saved. Default isgenerations.--elo_rating_dir, -e(optional): Directory to save ELO rating results. Default iselo_rating.
python calculate_elo_rating.py --model gpt-4o - Visualizing the model's revealed value preference from the ELO rating calculation above. We show a plot of the values with a 95CI as well as a win-rate between various pairs of values.
--model, -m(required): Name of the model to evaluate.--generations_dir, -g(optional): Directory containing generated outputs. Default isgenerations.--output_elo_fig_dir, -f(optional): Directory to save ELO score figures. Default isoutput_elo_figs.--output_win_rate_fig_dir, -w(optional): Directory to save win-rate figures. Default isoutput_win_rate_figs.
python visualize_elo_rating.py --model gpt-4o If you find this code useful, please cite the following works
@misc{chiu2025aitellliessave,
title={Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas},
author={Yu Ying Chiu and Zhilin Wang and Sharan Maiya and Yejin Choi and Kyle Fish and Sydney Levine and Evan Hubinger},
year={2025},
eprint={2505.14633},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14633},
}