Hi DeepSeek Team,
First of all, thank you for open-sourcing DeepSeek-V3.2 and sharing the impressive benchmark results.
Our team has been working to reproduce the results reported in your official release, but we’ve encountered some discrepancies that we’d like to understand better.
Reproduction Setup
-
Inference:
SGLang v0.5.6 on H200 * 8
-
Sampling parameters:
temperature = 1.0
top_p = 0.95
-
deploy:
python3 -m sglang.launch_server --model-path ./deepseek-ai/DeepSeek-V3___2/ --trust-remote-code --tp-size 8 --tool-call-parser deepseekv31 --reasoning-parser deepseek-v3 --chat-template ./deepseek/tool_chat_template_deepseekv32.jinja
-
run tau2-bench mark:
tau2 run --domain retail --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
tau2 run --domain airline --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
tau2 run --domain telecom --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
Our Reproduction Result:
tau2-bench(telecom): Pass^k Metrics: k=1: 0.000
Questions
Could you please share a bit more about your evaluation configuration?
Any clarification would be greatly appreciated!
Thanks again for releasing such a high-quality model.
Hi DeepSeek Team,
First of all, thank you for open-sourcing DeepSeek-V3.2 and sharing the impressive benchmark results.
Our team has been working to reproduce the results reported in your official release, but we’ve encountered some discrepancies that we’d like to understand better.
Reproduction Setup
Inference:
SGLang v0.5.6 on H200 * 8
Sampling parameters:
temperature = 1.0
top_p = 0.95
deploy:
python3 -m sglang.launch_server --model-path ./deepseek-ai/DeepSeek-V3___2/ --trust-remote-code --tp-size 8 --tool-call-parser deepseekv31 --reasoning-parser deepseek-v3 --chat-template ./deepseek/tool_chat_template_deepseekv32.jinja
run tau2-bench mark:
tau2 run --domain retail --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
tau2 run --domain airline --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
tau2 run --domain telecom --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
Our Reproduction Result:
tau2-bench(telecom): Pass^k Metrics: k=1: 0.000
Questions
Could you please share a bit more about your evaluation configuration?
Any clarification would be greatly appreciated!
Thanks again for releasing such a high-quality model.