[Feature]:[Question / DeepSeek-V4] KV cache FP8 layout for SWA / COMPRESS pools differs from the DeepSeek-V4 reference implementation (FlashMLA / SGLang) — any plan to align?

### 🚀 The feature, motivation and pitch

Hi TensorRT-LLM team — first of all, thanks a lot for the great work on the DeepSeek-V4 sparse attention support! 🙏While checking the V4 (Flash) KV-cache layout against the upstream DeepSeek reference (FlashMLA kernels / SGLang implementation), I noticed that TensorRT-LLM uses a different FP8 quantization scheme for the SWA pool and the COMPRESS pool than what the DeepSeek-V4 reference uses. Before opening any PR or filing it as a bug, I'd like to kindly ask whether this divergence is intentional, and if there is a roadmap / interest to align with the upstream layout.
What I observe
1. TensorRT-LLM (current main): per-tensor FP8 for both nope and rope, single global scale
Per-token bytes for SWA / COMPRESS = 512 B = 448 (nope FP8) + 64 (rope FP8), no in-cache scale.

2. DeepSeek-V4 reference (FlashMLA / SGLang): mixed-precision with per-tile (MX-FP8) scale
Per-token bytes for SWA / COMPRESS = 584 B = 448 (nope FP8 E4M3) + 128 (rope BF16) + 7 (UE8M0 per-tile scale) + 1 (pad).

Why this is concerning to me
1. Numerical accuracy of rope.
Quantizing the rotary segment to FP8 with a single global scale shared across nope+rope is, to my understanding, the most aggressive setting possible — much coarser than DeepSeek's design, which keeps rope at BF16. The rotary directions tend to have a different value distribution than nope, and folding them under one scalar (quant_scale_kv[0]) very likely costs end-to-end quality, especially when the calibration scale defaults to 1.0 (torch.ones(1) at init).
2. Per-tensor vs per-tile (MX-FP8).
Even on the nope path, the upstream design uses MX-FP8 (block-32 / block-64 with UE8M0 per-tile scales), which is the format that DeepSeek-V4 weights and activations are co-designed around. A single global scalar discards the per-tile dynamic range that the reference relies on.

Questions
1. Was the current per-tensor FP8 (with rope also quantized) layout an intentional design choice, or is it a placeholder inherited from earlier MLA / V3 paths?
2. Is there a roadmap to support the upstream fp8 nope + bf16 rope + UE8M0 per-tile scale layout (i.e. MX-FP8 SWA/COMPRESS pools)?
If yes, is there a tracking issue / PR I can follow?
If not, would the team be open to a community contribution adding this as a selectable mode (e.g. kv_cache_layout: ["per_tensor_fp8" | "mxfp8_v4"]) on DeepSeekV4SparseAttentionConfig?
Are there published accuracy numbers comparing the two layouts on representative V4 evaluation sets (MMLU / GSM8K / long-context)? I couldn't find any in the repo and would love to see them if they exist.

Thanks again for the work and for taking the time to read through this — happy to provide additional traces, accuracy logs, or a minimal repro if useful, and very happy to help on the implementation side if a community PR would be welcome. 🙏

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]:[Question / DeepSeek-V4] KV cache FP8 layout for SWA / COMPRESS pools differs from the DeepSeek-V4 reference implementation (FlashMLA / SGLang) — any plan to align? #14327

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]:[Question / DeepSeek-V4] KV cache FP8 layout for SWA / COMPRESS pools differs from the DeepSeek-V4 reference implementation (FlashMLA / SGLang) — any plan to align? #14327

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions