Skip to content

Reproduction Inquiry: DeepSeek-V3.2 tau2-bench Result #55

@DaiWulan2013

Description

@DaiWulan2013

Hi DeepSeek Team,
First of all, thank you for open-sourcing DeepSeek-V3.2 and sharing the impressive benchmark results.
Our team has been working to reproduce the results reported in your official release, but we’ve encountered some discrepancies that we’d like to understand better.

Reproduction Setup

  1. Inference:

    SGLang v0.5.6 on H200 * 8

  2. Sampling parameters:

    temperature = 1.0
    top_p = 0.95

  3. deploy:
    python3 -m sglang.launch_server --model-path ./deepseek-ai/DeepSeek-V3___2/ --trust-remote-code --tp-size 8 --tool-call-parser deepseekv31 --reasoning-parser deepseek-v3 --chat-template ./deepseek/tool_chat_template_deepseekv32.jinja

  4. run tau2-bench mark:
    tau2 run --domain retail --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
    tau2 run --domain airline --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'
    tau2 run --domain telecom --agent-llm openai/deepseek-v32 --user-llm openai/deepseek-v32 --max-concurrency 1 --agent-llm-args '{"temperature": 1.0,"api_key": "asdf1324","top_p":0.95, "api_base": "http://0.0.0.0:30000/v1"}' --user-llm-args '{"temperature": 1.0,"api_key": "asdf1324", "top_p":0.95,"api_base": "http://0.0.0.0:30000/v1"}'


Our Reproduction Result:
tau2-bench(telecom): Pass^k Metrics: k=1: 0.000

Questions
Could you please share a bit more about your evaluation configuration?

Any clarification would be greatly appreciated!
Thanks again for releasing such a high-quality model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions