Skip to content

Latest commit

 

History

History
131 lines (88 loc) · 4.79 KB

File metadata and controls

131 lines (88 loc) · 4.79 KB

Qwen3-Next Usage Guide

Qwen3-Next is an advanced large language model created by the Qwen team from Alibaba Cloud. It features several key improvements:

  • A hybrid attention mechanism
  • A highly sparse Mixture-of-Experts (MoE) structure
  • Training-stability-friendly optimizations
  • A multi-token prediction mechanism for faster inference

Installing vLLM

uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

Launching Qwen3-Next with vLLM

You can use 4x H200/H20 or 4x A100/A800 GPUs to launch this model.

Basic Multi-GPU Setup

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 4 \
  --served-model-name qwen3-next 

If you encounter torch.AcceleratorError: CUDA error: an illegal memory access was encountered, you can add --compilation_config.cudagraph_mode=PIECEWISE to the startup parameters to resolve this issue. This IMA error may occur in Data Parallel (DP) mode.

For FP8 model

We can accelerate the performance on SM100 machines using the FP8 FlashInfer TRTLLM MoE kernel.

VLLM_USE_FLASHINFER_MOE_FP8=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
VLLM_USE_DEEP_GEMM=0 \
VLLM_USE_TRTLLM_ATTENTION=0 \
VLLM_ATTENTION_BACKEND=FLASH_ATTN \
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
--tensor-parallel-size 4

For SM90/SM100 machines, we can enable fi_allreduce_fusion as follows:

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
--tensor-parallel-size 4 \
--compilation_config.pass_config.enable_fi_allreduce_fusion true \
--compilation_config.pass_config.enable_noop true

Advanced Configuration with MTP

Qwen3-Next also supports Multi-Token Prediction (MTP in short), you can launch the model server with the following arguments to enable MTP.

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct  \
--tokenizer-mode auto  --gpu-memory-utilization 0.8 \
--speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}' \
--tensor-parallel-size 4 --no-enable-chunked-prefill 

The speculative-config argument configures speculative decoding settings using a JSON format. The method "qwen3_next_mtp" specifies that the system should use Qwen3-Next's specialized multi-token prediction method. The "num_speculative_tokens": 2 setting means the model will speculate 2 tokens ahead during generation.

Performance Metrics

Benchmarking

We use the following script to demonstrate how to benchmark Qwen/Qwen3-Next-80B-A3B-Instruct.

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct \
  --served-model-name qwen3-next \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100 

Usage Tips

Tune MoE kernel

When starting the model service, you may encounter the following warning in the server log(Suppose the GPU is NVIDIA_H20-3e):

(VllmWorker TP2 pid=47571) WARNING 09-09 15:47:25 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/vllm_path/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_H20-3e.json']

You can use benchmark_moe to perform MoE Triton kernel tuning for your hardware. Once tuning is complete, a JSON file with a name like E=512,N=128,device_name=NVIDIA_H20-3e.json will be generated. You can specify the directory containing this file for your deployment hardware using the environment variable VLLM_TUNED_CONFIG_FOLDER, like:

VLLM_TUNED_CONFIG_FOLDER=your_moe_tuned_dir vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tensor-parallel-size 4 \
  --served-model-name qwen3-next 

You should see the following information printed in the server log. This indicates that the tuned MoE configuration has been loaded, which will improve the model service performance.

(VllmWorker TP2 pid=60498) INFO 09-09 16:23:07 [fused_moe.py:720] Using configuration from /your_moe_tuned_dir/E=512,N=128,device_name=NVIDIA_H20-3e.json for MoE layer.

Data Parallel Deployment

vLLM supports multi-parallel groups. You can refer to Data Parallel Deployment documentation and try parallel combinations that are more suitable for this model.

Function calling

vLLM also supports calling user-defined functions. Make sure to run your Qwen3-Next models with the following arguments.

vllm serve ... --tool-call-parser hermes --enable-auto-tool-choice

Known limitations

  • Qwen3-Next currently does not support automatic prefix caching.