OdysseyBench is a comprehensive benchmark and evaluation suite for task-oriented agent systems, supporting both the OdysseyBench+ and OdysseyBench-Neo tracks. This project provides tools for task generation, execution, validation, and in-depth evaluation of agent performance, with a focus on memory and retrieval-augmented generation (RAG) capabilities.
git clone https://github.com/microsoft/OdysseyBench.git
git clone https://github.com/zlwang-cs/OfficeBench.git /tmp/OfficeBench
find /tmp/OfficeBench/tasks/ -type d -name testbed -exec bash -c 'dest="OdysseyBench/tasks/${1#*/tasks/}"; mkdir -p "$dest"; cp -r "$1" "$dest/../"' _ {} \;
rm -rf /tmp/OfficeBench
conda create -n odysseybench python=3.10
pip install -r requirements.txt
export OPENAI_API_KEY=OPENAI_KEY
-
/tasks/substasks_plus: Tasks for OdysseyBench+
-
/tasks/chat_histories_plus: Dialogues for OdysseyBench+
-
/tasks/substasks_neo: Tasks for OdysseyBench-Neo
-
/tasks/chat_histories_neo: Dialogues for OdysseyBench-Neo
-
/tasks/outputs/: Results of task execution
-
/tasks/testbed/: Files required for task execution
Edit config/base_config.yaml:
memory:
mode: "use_rag" # Options: raw_chat, use_rag, clean
rag_mode: "summarysession" # summarysession, dialogsession, dialogutterance, summarychunk
top_k: 5 # Used in 'use_rag' mode- Long-Context Evaluation:
Setmode: raw_chatto include all dialogues in the prompt (ignoresrag_modeandtop_k). - RAG Evaluation:
- For raw context: set
rag_modetodialoguesessionordialogueutterance. - For summary: set
rag_modetosummarysessionorsummarychunk.
- For raw context: set
python run_all.py --tag OdysseyBench_pluspython run_all.py --neo --tag OdysseyBench_neopython run_homeragents_plus.py --loops 5python run_homeragents_neo.py- Cross Validation
Select the intersection of successfully executed tasks by task-description and task-intent + task-instruction:
python run_all.py --neo_clean --tag test-neo-ground-truth-memory
# Set 'mode' as 'clean' in configs/base_config.yaml
python run_all.py --neo_clean --tag test-neo--task_description
# Set 'Memory' as 'False' in configs/base_config.yaml- Evaluate Execution Performance
sh evaluate_all.sh test-neo-ground-truth-memory o3 True
sh evaluate_all.sh test-neo-task_description o3 True- Cross-validation selection:
python utils_clean/cross_validation.py
- Uniform task format:
python utils_clean/clean_tasks.py
- Uniform dialogue format:
python utils_clean/clean_dialogue.py
python llm-as-a-judge.pyContributions are welcome! Please open issues or pull requests for improvements or questions.
If you found this code useful, please cite the following paper:
@article{wang2025odysseybench,
title={OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows},
author={Wang, Weixuan and Han, Dongge and Diaz, Daniel Madrigal and Xu, Jin and R{\"u}hle, Victor and Rajmohan, Saravan},
journal={arXiv preprint arXiv:2508.09124},
year={2025}
}
This project builds on and incorporates material from
[OfficeBench](https://github.com/zlwang-cs/OfficeBench). See NOTICE.txt
for attribution details.