π The feature, motivation and pitch
Hi TensorRT-LLM team β first of all, thanks a lot for the great work on the DeepSeek-V4 sparse attention support! πWhile checking the V4 (Flash) KV-cache layout against the upstream DeepSeek reference (FlashMLA kernels / SGLang implementation), I noticed that TensorRT-LLM uses a different FP8 quantization scheme for the SWA pool and the COMPRESS pool than what the DeepSeek-V4 reference uses. Before opening any PR or filing it as a bug, I'd like to kindly ask whether this divergence is intentional, and if there is a roadmap / interest to align with the upstream layout.
What I observe
-
TensorRT-LLM (current main): per-tensor FP8 for both nope and rope, single global scale
Per-token bytes for SWA / COMPRESS = 512 B = 448 (nope FP8) + 64 (rope FP8), no in-cache scale.
-
DeepSeek-V4 reference (FlashMLA / SGLang): mixed-precision with per-tile (MX-FP8) scale
Per-token bytes for SWA / COMPRESS = 584 B = 448 (nope FP8 E4M3) + 128 (rope BF16) + 7 (UE8M0 per-tile scale) + 1 (pad).
Why this is concerning to me
- Numerical accuracy of rope.
Quantizing the rotary segment to FP8 with a single global scale shared across nope+rope is, to my understanding, the most aggressive setting possible β much coarser than DeepSeek's design, which keeps rope at BF16. The rotary directions tend to have a different value distribution than nope, and folding them under one scalar (quant_scale_kv[0]) very likely costs end-to-end quality, especially when the calibration scale defaults to 1.0 (torch.ones(1) at init).
- Per-tensor vs per-tile (MX-FP8).
Even on the nope path, the upstream design uses MX-FP8 (block-32 / block-64 with UE8M0 per-tile scales), which is the format that DeepSeek-V4 weights and activations are co-designed around. A single global scalar discards the per-tile dynamic range that the reference relies on.
Questions
- Was the current per-tensor FP8 (with rope also quantized) layout an intentional design choice, or is it a placeholder inherited from earlier MLA / V3 paths?
- Is there a roadmap to support the upstream fp8 nope + bf16 rope + UE8M0 per-tile scale layout (i.e. MX-FP8 SWA/COMPRESS pools)?
If yes, is there a tracking issue / PR I can follow?
If not, would the team be open to a community contribution adding this as a selectable mode (e.g. kv_cache_layout: ["per_tensor_fp8" | "mxfp8_v4"]) on DeepSeekV4SparseAttentionConfig?
Are there published accuracy numbers comparing the two layouts on representative V4 evaluation sets (MMLU / GSM8K / long-context)? I couldn't find any in the repo and would love to see them if they exist.
Thanks again for the work and for taking the time to read through this β happy to provide additional traces, accuracy logs, or a minimal repro if useful, and very happy to help on the implementation side if a community PR would be welcome. π
Alternatives
No response
Additional context
No response
Before submitting a new issue...
π The feature, motivation and pitch
Hi TensorRT-LLM team β first of all, thanks a lot for the great work on the DeepSeek-V4 sparse attention support! πWhile checking the V4 (Flash) KV-cache layout against the upstream DeepSeek reference (FlashMLA kernels / SGLang implementation), I noticed that TensorRT-LLM uses a different FP8 quantization scheme for the SWA pool and the COMPRESS pool than what the DeepSeek-V4 reference uses. Before opening any PR or filing it as a bug, I'd like to kindly ask whether this divergence is intentional, and if there is a roadmap / interest to align with the upstream layout.
What I observe
TensorRT-LLM (current main): per-tensor FP8 for both nope and rope, single global scale
Per-token bytes for SWA / COMPRESS = 512 B = 448 (nope FP8) + 64 (rope FP8), no in-cache scale.
DeepSeek-V4 reference (FlashMLA / SGLang): mixed-precision with per-tile (MX-FP8) scale
Per-token bytes for SWA / COMPRESS = 584 B = 448 (nope FP8 E4M3) + 128 (rope BF16) + 7 (UE8M0 per-tile scale) + 1 (pad).
Why this is concerning to me
Quantizing the rotary segment to FP8 with a single global scale shared across nope+rope is, to my understanding, the most aggressive setting possible β much coarser than DeepSeek's design, which keeps rope at BF16. The rotary directions tend to have a different value distribution than nope, and folding them under one scalar (quant_scale_kv[0]) very likely costs end-to-end quality, especially when the calibration scale defaults to 1.0 (torch.ones(1) at init).
Even on the nope path, the upstream design uses MX-FP8 (block-32 / block-64 with UE8M0 per-tile scales), which is the format that DeepSeek-V4 weights and activations are co-designed around. A single global scalar discards the per-tile dynamic range that the reference relies on.
Questions
If yes, is there a tracking issue / PR I can follow?
If not, would the team be open to a community contribution adding this as a selectable mode (e.g. kv_cache_layout: ["per_tensor_fp8" | "mxfp8_v4"]) on DeepSeekV4SparseAttentionConfig?
Are there published accuracy numbers comparing the two layouts on representative V4 evaluation sets (MMLU / GSM8K / long-context)? I couldn't find any in the repo and would love to see them if they exist.
Thanks again for the work and for taking the time to read through this β happy to provide additional traces, accuracy logs, or a minimal repro if useful, and very happy to help on the implementation side if a community PR would be welcome. π
Alternatives
No response
Additional context
No response
Before submitting a new issue...