Qwen3-Coder is an advanced large language model created by the Qwen team from Alibaba Cloud. vLLM already supports Qwen3-Coder, and tool-call functionality will be available in vLLM v0.10.0 and higher You can install vLLM with tool-call support using the following method:
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend autoBF16 Model
vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
--max-model-len 32000 \
--enable-expert-parallel \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderFP8 Model
VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--max-model-len 131072 \
--enable-expert-parallel \
--data-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderWe launched Qwen3-Coder-480B-A35B-Instruct-FP8 using vLLM and evaluated its performance using EvalPlus. The results are displayed below:
| Dataset | Test Type | Pass@1 Score |
|---|---|---|
| HumanEval | Base tests | 0.939 |
| HumanEval+ | Base + extra tests | 0.902 |
| MBPP | Base tests | 0.918 |
| MBPP+ | Base + extra tests | 0.794 |
We used the following script to benchmark Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8
vllm bench serve \
--backend vllm \
--model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100 \If successful, you will see the following output.
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 776.49
Total input tokens: 204169
Total generated tokens: 102400
Request throughput (req/s): 0.13
Output token throughput (tok/s): 131.88
Total Token throughput (tok/s): 394.81
---------------Time to First Token----------------
Mean TTFT (ms): 7639.31
Median TTFT (ms): 6935.71
P99 TTFT (ms): 13766.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 68.43
Median TPOT (ms): 67.23
P99 TPOT (ms): 72.14
---------------Inter-token Latency----------------
Mean ITL (ms): 68.43
Median ITL (ms): 66.34
P99 ITL (ms): 69.38
==================================================
- Context Length Limitation: A single H20 node cannot serve the original context length (262144). You can reduce the
max-model-lenor increasegpu-memory-utilizationto work within memory constraints.
- Context Length Limitation: A single H20 node cannot serve the original context length (262144). You can reduce the
max-model-lenor increasegpu-memory-utilizationto work within memory constraints. - DeepGEMM Usage: To use DeepGEMM, set
VLLM_USE_DEEP_GEMM=1. Follow the setup instructions to install it. - Tensor Parallelism Issue: When using
tensor-parallel-size 8, the following failures are expected. Switch to data-parallel mode using--data-parallel-size. - Additional Resources: Refer to the Data Parallel Deployment documentation for more parallelism groups.
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 336, in <lambda>
ERROR [multiproc_executor.py:511] lambda prefix: Qwen3MoeDecoderLayer(config=config,
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 278, in __init__
ERROR [multiproc_executor.py:511] self.mlp = Qwen3MoeSparseMoeBlock(config=config,
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 113, in __init__
ERROR [multiproc_executor.py:511] self.experts = FusedMoE(num_experts=config.num_experts,
ERROR [multiproc_executor.py:511] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 773, in __init__
ERROR [multiproc_executor.py:511] self.quant_method.create_weights(layer=self, **moe_quant_params)
ERROR [multiproc_executor.py:511] File "/vllm/vllm/model_executor/layers/quantization/fp8.py", line 573, in create_weights
ERROR [multiproc_executor.py:511] raise ValueError(
ERROR [multiproc_executor.py:511] ValueError: The output_size of gate's and up's weight = 320 is not divisible by weight quantization block_n = 128.- Enable Tool Calls: Add
--tool-call-parser qwen3_coderto enable tool call parsing functionality, please refer to: tool_calling
- Add benchmark results