ProactiveBench is a benchmark for evaluating proactive agents. It includes a dataset, a reward model, and evaluation scripts.
Our test set contains events in three categories: coding, writing, and daily life.
Currently, the test set contains 227 events.
The reward model is trained on the dataset and reaches an F1 score of 0.918 on the test set.
We provide all scripts to evaluate the performance of the proactive agent and the reward model.
The reward model is used to evaluate the performance of the Proactive Agent. You can download the reward model from here (Coming soon) and host it with frameworks like VLLM to provide OpenAI style API.
After that, you should change the script reward_model_scoring.py to set the address of your model, and run the script with
python eval/reward_model_scoring.pyAfter the process, you will get the final score for your reward model.
To check your model's performance, you will need to change the ./eval/script.py and load in your model(or use the SDK), and run the script with:
python eval/script.pyThe test data will be send to the model, and all the traces with agent response will be saved under ./eval/traces_new folder.
After the process, you could run
# You should modify the address in judge_agent_prediction.py to your reward model address before run the script.
sh eval/judge_result.shwhich will let the Reward Model to evaluate whether the response from the agent is acceptable or not. The results will be saved under ./eval/judged folder.
After judged by the reward model, you could run
sh calculate.shto finally get a score for your model.