Reproduction Inquiry: DeepSeek-V3.2  tau2-bench Result

**Hi DeepSeek Team,**
    First of all, thank you for open-sourcing DeepSeek-V3.2 and sharing the impressive benchmark results.
Our team has been working to reproduce the results reported in your official release, but we’ve encountered some discrepancies that we’d like to understand better.

**Reproduction Setup**

1. Inference: 

    SGLang v0.5.6 on H200 * 8

2. Sampling parameters:

     temperature = 1.0
     top_p = 0.95

3. deploy：
python3 -m sglang.launch_server   --model-path    ./deepseek-ai/DeepSeek-V3___2/ --trust-remote-code   --tp-size 8   --tool-call-parser deepseekv31   --reasoning-parser deepseek-v3   --chat-template ./deepseek/tool_chat_template_deepseekv32.jinja

4. run tau2-bench mark:
tau2 run --domain retail  --agent-llm openai/deepseek-v32  --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
tau2 run --domain airline   --agent-llm openai/deepseek-v32  --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
tau2 run --domain telecom   --agent-llm openai/deepseek-v32  --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
--------------------------------------------------------------------------------------------------------------------------
**Our Reproduction Result:**
tau2-bench(telecom): Pass^k Metrics: k=1: 0.000

**Questions**
Could you please share a bit more about your evaluation configuration? 

Any clarification would be greatly appreciated!
Thanks again for releasing such a high-quality model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction Inquiry: DeepSeek-V3.2 tau2-bench Result #55

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reproduction Inquiry: DeepSeek-V3.2 tau2-bench Result #55

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions