[Issue]: Running DeepSeek V3.2 on MI308x*8 with Atom-vLLM only 200-300K TPM in single-machine performance

### Problem Description

Input/Output token length: 8K/1K, with kv cache hit rate: 80+%
running command:
```
vllm serve $MODEL_PATH  --tensor-parallel-size 8 --kv-cache-dtype fp8 --gpu_memory_utilization 0.9 --async-scheduling --max-num-seqs 48 --tool-call-parser deepseek_v32 --enable-auto-tool-choice --reasoning-parser deepseek_v3 --max-model-len 32K
```

Are specialized optimizations required for the MI308x?

### Operating System

CentOS 8

### CPU

AMD EPYC 9K84 96-Core Processor

### GPU

AMD MI308*8

### ROCm Version

722

### ROCm Component

_No response_

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Running DeepSeek V3.2 on MI308x*8 with Atom-vLLM only 200-300K TPM in single-machine performance #896

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Issue]: Running DeepSeek V3.2 on MI308x*8 with Atom-vLLM only 200-300K TPM in single-machine performance #896

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions