This repository contains code and data referenced in: "Role-Playing Evaluation for Large Language Models".
Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency.
Clone the repository and install the dependencies:
git clone https://github.com/yelboudouri/RPEval.git
cd RPEval
pip install -r requirements.txtTo reproduce the evaluation results from the paper:
python eval.py --responses-file=data/responses_gpt_4o_2024_08_06.jsonlTo test other models, simply change the --responses-file argument to the appropriate file under the data/ directory.
To run RPEval on a different model:
python eval.py --provider="<provider_name>" --model="<model_name>"RPEval uses SwitchAI under the hood. Ensure your API key is properly configured and the target model is supported.
If you use this code in your research, please cite the following paper:
@misc{boudouri2025roleplayingevaluationlargelanguage,
title={Role-Playing Evaluation for Large Language Models},
author={Yassine El Boudouri and Walter Nuninger and Julian Alvarez and Yvan Peter},
year={2025},
eprint={2505.13157},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.13157},
}