Description
使用vllm 0.17.0 A100部署qwen3.6 27B调用v1/messages接口,返回体只有thinking
Reproduction
export CUDA_VISIBLE_DEVICES=4,5
python -m vllm.entrypoints.openai.api_server --host 127.0.0.1--port 8042 --model /data1/Qwen3.6-27B --served-model-name GTSLLM-Standard --data-parallel-size 1 --tensor-parallel-size 2 --max-model-len 162144 --max-num-seqs 8 --gpu-memory-utilization 0.90 --trust-remote-code --compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY","cudagraph_capture_sizes":[1,2,4,8,16,32]}' --additional-config '{"enable_cpu_binding":true}' --async-scheduling --chat_template /data1/Qwen3.6-27B/chat_template.jinja --enable-auto-tool-choice --tool-call-parser "qwen3_coder" --reasoning-parser "qwen3" --no-enable-prefix-caching --no-enable-chunked-prefill --mm-processor-cache-gb 0 --mamba-cache-mode align
Logs
Environment Information
vllm 0.17.0 NVIDIA A100
Known Issue
Description
使用vllm 0.17.0 A100部署qwen3.6 27B调用v1/messages接口,返回体只有thinking
Reproduction
export CUDA_VISIBLE_DEVICES=4,5
python -m vllm.entrypoints.openai.api_server --host 127.0.0.1--port 8042 --model /data1/Qwen3.6-27B --served-model-name GTSLLM-Standard --data-parallel-size 1 --tensor-parallel-size 2 --max-model-len 162144 --max-num-seqs 8 --gpu-memory-utilization 0.90 --trust-remote-code --compilation_config '{"cudagraph_mode":"FULL_DECODE_ONLY","cudagraph_capture_sizes":[1,2,4,8,16,32]}' --additional-config '{"enable_cpu_binding":true}' --async-scheduling --chat_template /data1/Qwen3.6-27B/chat_template.jinja --enable-auto-tool-choice --tool-call-parser "qwen3_coder" --reasoning-parser "qwen3" --no-enable-prefix-caching --no-enable-chunked-prefill --mm-processor-cache-gb 0 --mamba-cache-mode align
Logs
Environment Information
vllm 0.17.0 NVIDIA A100
Known Issue