This guide describes how to run DeepSeek-V3 or DeepSeek-R1 with native FP8 or FP4. In the guide, we use DeepSeek-R1 as an example, but the same applies to DeepSeek-V3 given they have the same model architecture.
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend autoThere are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads.
For blackwell, first enable flashinfer env vars:
export VLLM_ATTENTION_BACKEND=CUTLASS_MLA
export VLLM_USE_FLASHINFER_MOE_FP8=1Then, run tensor-parallel like this:
# Start server with FP8 model on 8 GPUs
vllm serve deepseek-ai/DeepSeek-R1-0528 \
--trust-remote-code \
--tensor-parallel-size 8 \
--enable-expert-parallelOr data-parallel like this:
# Start server with FP8 model on 8 GPUs
vllm serve deepseek-ai/DeepSeek-R1-0528 \
--trust-remote-code \
--data-parallel-size 8 \
--enable-expert-parallelNote that in both cases we enable expert-parallel as well, so the first run is what we call TP+EP and the second one is DP+EP.
Additional flags:
- For non-flashinfer runs, one can use VLLM_USE_DEEP_GEMM and VLLM_ALL2ALL_BACKEND. For example:
export VLLM_USE_DEEP_GEMM=1
export VLLM_ALL2ALL_BACKEND="deepep_high_throughput" # or "deepep_low_latency"- You can set
--max-model-lento preserve memory.--max-model-len=65536is usually good for most scenarios. - You can set
--max-num-batched-tokensto balance throughput and latency, higher means higher throughput but higher latency.--max-num-batched-tokens=32768is usually good for prompt-heavy workloads. But you can reduce it to 16k and 8k to reduce activation memory usage and decrease latency. - vLLM conservatively use 90% of GPU memory, you can set
--gpu-memory-utilization=0.95to maximize KVCache.
For Blackwell GPUs, add these environment variables before running:
export VLLM_ATTENTION_BACKEND=CUTLASS_MLA
export VLLM_USE_FLASHINFER_MOE_FP4=1Then run tensor-parallel:
# The model is runnable on 4 or 8 GPUs, here we show usage of 4.
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-R1-FP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--enable-expert-parallelOr, data-parallel:
# The model is runnable on 4 or 8 GPUs, here we show usage of 4.
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve nvidia/DeepSeek-R1-FP4 \
--trust-remote-code \
--data-parallel-size 4 \
--enable-expert-parallelFor benchmarking, disable prefix caching by adding --no-enable-prefix-caching to the server command.
# Prompt-heavy benchmark (8k/1k)
vllm bench serve \
--model deepseek-ai/DeepSeek-R1-0528 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eos# Prompt-heavy benchmark (8k/1k)
vllm bench serve \
--model nvidia/DeepSeek-R1-FP4 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10000 \
--num-prompts 16 \
--ignore-eosTest different workloads by adjusting input/output lengths:
- Prompt-heavy: 8000 input / 1000 output
- Decode-heavy: 1000 input / 8000 output
- Balanced: 1000 input / 1000 output
Test different batch sizes by changing --num-prompts:
- Batch sizes: 1, 16, 32, 64, 128, 256, 512
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 16.39
Total input tokens: 7902
Total generated tokens: 1000
Request throughput (req/s): 0.06
Output token throughput (tok/s): 61.00
Total Token throughput (tok/s): 543.06
---------------Time to First Token----------------
Mean TTFT (ms): 560.00
Median TTFT (ms): 560.00
P99 TTFT (ms): 560.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 15.85
Median TPOT (ms): 15.85
P99 TPOT (ms): 15.85
---------------Inter-token Latency----------------
Mean ITL (ms): 15.85
Median ITL (ms): 15.85
P99 ITL (ms): 16.15
==================================================