You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tau2-Bench is a benchmark for evaluating tool call of Agent models.
4
+
5
+
This evaluation used the [official tau2-bench repository](https://github.com/sierra-research/tau2-bench).
6
+
7
+
## Installation
8
+
9
+
```bash
10
+
pip install -e .
11
+
```
12
+
13
+
## Configuration
14
+
15
+
Configure the API in the `.env` file:
16
+
17
+
- Set `USE_AZURE_OPENAI="true"` to use Azure OpenAI API
18
+
- Set `USE_AZURE_OPENAI="false"` to use standard OpenAI API
19
+
20
+
Fill in the corresponding API Key and Endpoint based on your choice.
21
+
22
+
## Evaluation
23
+
24
+
1. Modify the models list in `eval.sh`:
25
+
```bash
26
+
models=(
27
+
"your-model-name"
28
+
)
29
+
```
30
+
31
+
2. Run evaluation:
32
+
```bash
33
+
bash eval.sh
34
+
```
35
+
36
+
Main parameters:
37
+
```bash
38
+
tau2 run \
39
+
--domain retail \ # Domain: retail or airline
40
+
--agent-llm openai/$model\ # Agent model
41
+
--user-llm gpt-4.1 \ # User simulation model
42
+
--num-trials 4 \ # Number of trials
43
+
--max-concurrency 6 # Concurrency
44
+
```
45
+
46
+
## Notes
47
+
48
+
⚠️ Tau2-Bench evaluation results have high variance. It is recommended to run **4 repeated trials and take the average** for stable and converged results.
49
+
50
+
## Citation
51
+
52
+
```bibtex
53
+
@misc{barres2025tau2,
54
+
title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
55
+
author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
0 commit comments