Skip to content

Commit 6cf8d5f

Browse files
committed
update_tau2_eval
1 parent c79d4e0 commit 6cf8d5f

File tree

154 files changed

+707183
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

154 files changed

+707183
-0
lines changed

tau2-bench/.env

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# ============================================
2+
# API Provider Configuration
3+
# ============================================
4+
# Set to "true" to use Azure OpenAI, "false" to use standard OpenAI
5+
USE_AZURE_OPENAI="false"
6+
7+
# ============================================
8+
# Standard OpenAI Configuration
9+
# ============================================
10+
# Used when USE_AZURE_OPENAI=false
11+
OPENAI_API_KEY=""
12+
13+
# ============================================
14+
# Azure OpenAI Configuration
15+
# ============================================
16+
# Used when USE_AZURE_OPENAI=true
17+
18+
# Azure endpoint and deployment (if set, these override the default configs)
19+
AZURE_OPENAI_ENDPOINT=""
20+
AZURE_OPENAI_API_VERSION="2025-01-01-preview"
21+
AZURE_OPENAI_DEPLOYMENT="gpt-4.1"
22+
23+
# Models to route through Azure
24+
AZURE_OPENAI_MODELS="gpt-4.1"
25+
26+
# Optional: LLM request logging
27+
LLM_LOG_ENABLED="true"
28+
LLM_LOG_DIR="logs/llm_requests"
29+
30+
# ============================================
31+
# Other LLM Providers
32+
# ============================================
33+
ANTHROPIC_API_KEY="EMPTY"

tau2-bench/.gitignore

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
*.tmp
2+
.DS_Store
3+
tmp.*
4+
tmp_*
5+
tmp-*
6+
tmp/
7+
dumps/
8+
dump.rdb
9+
*.rdg
10+
logs/
11+
data/simulations/
12+
13+
.ruff_cache
14+
15+
# Byte-compiled / optimized / DLL files
16+
__pycache__/
17+
*.py[cod]
18+
*$py.class
19+
20+
# C extensions
21+
*.so
22+
23+
# Distribution / packaging
24+
.Python
25+
build/
26+
develop-eggs/
27+
dist/
28+
downloads/
29+
eggs/
30+
.eggs/
31+
lib/
32+
lib64/
33+
parts/
34+
sdist/
35+
var/
36+
wheels/
37+
share/python-wheels/
38+
*.egg-info/
39+
.installed.cfg
40+
*.egg
41+
MANIFEST
42+
.vscode/
43+
44+
# PyInstaller
45+
# Usually these files are written by a python script from a template
46+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
47+
*.manifest
48+
*.spec
49+
50+
# Installer logs
51+
pip-log.txt
52+
pip-delete-this-directory.txt
53+
54+
# Unit test / coverage reports
55+
htmlcov/
56+
.tox/
57+
.nox/
58+
.coverage
59+
.coverage.*
60+
.cache
61+
nosetests.xml
62+
coverage.xml
63+
*.cover
64+
*.py,cover
65+
.hypothesis/
66+
.pytest_cache/
67+
cover/
68+
69+
# Translations
70+
*.mo
71+
*.pot
72+
73+
# Django stuff:
74+
*.log
75+
local_settings.py
76+
db.sqlite3
77+
db.sqlite3-journal
78+
79+
# Flask stuff:
80+
instance/
81+
.webassets-cache
82+
83+
# Scrapy stuff:
84+
.scrapy
85+
86+
# Sphinx documentation
87+
docs/_build/
88+
89+
# PyBuilder
90+
.pybuilder/
91+
target/
92+
93+
# Jupyter Notebook
94+
.ipynb_checkpoints
95+
96+
# IPython
97+
profile_default/
98+
ipython_config.py
99+
100+
# pyenv
101+
# For a library or package, you might want to ignore these files since the code is
102+
# intended to run in multiple environments; otherwise, check them in:
103+
# .python-version
104+
105+
# pipenv
106+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
107+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
108+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
109+
# install all needed dependencies.
110+
#Pipfile.lock
111+
112+
# UV
113+
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
114+
# This is especially recommended for binary packages to ensure reproducibility, and is more
115+
# commonly ignored for libraries.
116+
#uv.lock
117+
118+
# poetry
119+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
120+
# This is especially recommended for binary packages to ensure reproducibility, and is more
121+
# commonly ignored for libraries.
122+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
123+
#poetry.lock
124+
125+
# pdm
126+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
127+
#pdm.lock
128+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
129+
# in version control.
130+
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
131+
.pdm.toml
132+
.pdm-python
133+
.pdm-build/
134+
135+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
136+
__pypackages__/
137+
138+
# Celery stuff
139+
celerybeat-schedule
140+
celerybeat.pid
141+
142+
# SageMath parsed files
143+
*.sage.py
144+
145+
# Environments
146+
.venv
147+
env/
148+
venv/
149+
ENV/
150+
env.bak/
151+
venv.bak/
152+
153+
# Spyder project settings
154+
.spyderproject
155+
.spyproject
156+
157+
# Rope project settings
158+
.ropeproject
159+
160+
# mkdocs documentation
161+
/site
162+
163+
# mypy
164+
.mypy_cache/
165+
.dmypy.json
166+
dmypy.json
167+
168+
# Pyre type checker
169+
.pyre/
170+
171+
# pytype static type analyzer
172+
.pytype/
173+
174+
# Cython debug symbols
175+
cython_debug/
176+
177+
# PyCharm
178+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
179+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
180+
# and can be added to the global gitignore or merged into this file. For a more nuclear
181+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
182+
#.idea/
183+
184+
# PyPI configuration file
185+
.pypirc
186+
logs/

tau2-bench/.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.13

tau2-bench/LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 Sierra Research
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

tau2-bench/Makefile

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Default target
2+
.PHONY: all
3+
all: help
4+
5+
## Clean up generated files and virtual environment
6+
.PHONY: clean
7+
clean:
8+
rm -rf .venv
9+
rm -rf __pycache__
10+
rm -rf *.egg-info
11+
rm -rf .pytest_cache
12+
rm -rf dist
13+
rm -rf build
14+
15+
## Run all tests
16+
.PHONY: test
17+
test:
18+
pytest tests/
19+
20+
21+
## Start the Environment CLI for interacting with domain environments
22+
.PHONY: env-cli
23+
env-cli:
24+
python -m tau2.environment.utils.interface_agent
25+
26+
## Display online help for commonly used targets in this Makefile
27+
.PHONY: help
28+
help:
29+
@awk '/^[a-zA-Z_\/\.0-9-]+:/ { \
30+
nb = sub( /^## /, "", helpMsg ); \
31+
if (nb) \
32+
print $$1 "\t" helpMsg; \
33+
} \
34+
{ helpMsg = $$0 }' $(MAKEFILE_LIST) | \
35+
column -ts $$'\t' | \
36+
expand -t 1 | \
37+
grep --color '^[^ ]*'

tau2-bench/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Tau2-Bench
2+
3+
Tau2-Bench is a benchmark for evaluating tool call of Agent models.
4+
5+
This evaluation used the [official tau2-bench repository](https://github.com/sierra-research/tau2-bench).
6+
7+
## Installation
8+
9+
```bash
10+
pip install -e .
11+
```
12+
13+
## Configuration
14+
15+
Configure the API in the `.env` file:
16+
17+
- Set `USE_AZURE_OPENAI="true"` to use Azure OpenAI API
18+
- Set `USE_AZURE_OPENAI="false"` to use standard OpenAI API
19+
20+
Fill in the corresponding API Key and Endpoint based on your choice.
21+
22+
## Evaluation
23+
24+
1. Modify the models list in `eval.sh`:
25+
```bash
26+
models=(
27+
"your-model-name"
28+
)
29+
```
30+
31+
2. Run evaluation:
32+
```bash
33+
bash eval.sh
34+
```
35+
36+
Main parameters:
37+
```bash
38+
tau2 run \
39+
--domain retail \ # Domain: retail or airline
40+
--agent-llm openai/$model \ # Agent model
41+
--user-llm gpt-4.1 \ # User simulation model
42+
--num-trials 4 \ # Number of trials
43+
--max-concurrency 6 # Concurrency
44+
```
45+
46+
## Notes
47+
48+
⚠️ Tau2-Bench evaluation results have high variance. It is recommended to run **4 repeated trials and take the average** for stable and converged results.
49+
50+
## Citation
51+
52+
```bibtex
53+
@misc{barres2025tau2,
54+
title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
55+
author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
56+
year={2025},
57+
eprint={2506.07982},
58+
archivePrefix={arXiv},
59+
primaryClass={cs.AI},
60+
url={https://arxiv.org/abs/2506.07982},
61+
}
62+
```

0 commit comments

Comments
 (0)