Problem Description
Input/Output token length: 8K/1K, with kv cache hit rate: 80+%
running command:
vllm serve $MODEL_PATH --tensor-parallel-size 8 --kv-cache-dtype fp8 --gpu_memory_utilization 0.9 --async-scheduling --max-num-seqs 48 --tool-call-parser deepseek_v32 --enable-auto-tool-choice --reasoning-parser deepseek_v3 --max-model-len 32K
Are specialized optimizations required for the MI308x?
Operating System
CentOS 8
CPU
AMD EPYC 9K84 96-Core Processor
GPU
AMD MI308*8
ROCm Version
722
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Problem Description
Input/Output token length: 8K/1K, with kv cache hit rate: 80+%
running command:
Are specialized optimizations required for the MI308x?
Operating System
CentOS 8
CPU
AMD EPYC 9K84 96-Core Processor
GPU
AMD MI308*8
ROCm Version
722
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response