Skip to content

[Feature]:[Question / DeepSeek-V4] KV cache FP8 layout for SWA / COMPRESS pools differs from the DeepSeek-V4 reference implementation (FlashMLA / SGLang) β€” any plan to align?Β #14327

@inference666

Description

@inference666

πŸš€ The feature, motivation and pitch

Hi TensorRT-LLM team β€” first of all, thanks a lot for the great work on the DeepSeek-V4 sparse attention support! πŸ™While checking the V4 (Flash) KV-cache layout against the upstream DeepSeek reference (FlashMLA kernels / SGLang implementation), I noticed that TensorRT-LLM uses a different FP8 quantization scheme for the SWA pool and the COMPRESS pool than what the DeepSeek-V4 reference uses. Before opening any PR or filing it as a bug, I'd like to kindly ask whether this divergence is intentional, and if there is a roadmap / interest to align with the upstream layout.
What I observe

  1. TensorRT-LLM (current main): per-tensor FP8 for both nope and rope, single global scale
    Per-token bytes for SWA / COMPRESS = 512 B = 448 (nope FP8) + 64 (rope FP8), no in-cache scale.

  2. DeepSeek-V4 reference (FlashMLA / SGLang): mixed-precision with per-tile (MX-FP8) scale
    Per-token bytes for SWA / COMPRESS = 584 B = 448 (nope FP8 E4M3) + 128 (rope BF16) + 7 (UE8M0 per-tile scale) + 1 (pad).

Why this is concerning to me

  1. Numerical accuracy of rope.
    Quantizing the rotary segment to FP8 with a single global scale shared across nope+rope is, to my understanding, the most aggressive setting possible β€” much coarser than DeepSeek's design, which keeps rope at BF16. The rotary directions tend to have a different value distribution than nope, and folding them under one scalar (quant_scale_kv[0]) very likely costs end-to-end quality, especially when the calibration scale defaults to 1.0 (torch.ones(1) at init).
  2. Per-tensor vs per-tile (MX-FP8).
    Even on the nope path, the upstream design uses MX-FP8 (block-32 / block-64 with UE8M0 per-tile scales), which is the format that DeepSeek-V4 weights and activations are co-designed around. A single global scalar discards the per-tile dynamic range that the reference relies on.

Questions

  1. Was the current per-tensor FP8 (with rope also quantized) layout an intentional design choice, or is it a placeholder inherited from earlier MLA / V3 paths?
  2. Is there a roadmap to support the upstream fp8 nope + bf16 rope + UE8M0 per-tile scale layout (i.e. MX-FP8 SWA/COMPRESS pools)?
    If yes, is there a tracking issue / PR I can follow?
    If not, would the team be open to a community contribution adding this as a selectable mode (e.g. kv_cache_layout: ["per_tensor_fp8" | "mxfp8_v4"]) on DeepSeekV4SparseAttentionConfig?
    Are there published accuracy numbers comparing the two layouts on representative V4 evaluation sets (MMLU / GSM8K / long-context)? I couldn't find any in the repo and would love to see them if they exist.

Thanks again for the work and for taking the time to read through this β€” happy to provide additional traces, accuracy logs, or a minimal repro if useful, and very happy to help on the implementation side if a community PR would be welcome. πŸ™

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

KV-Cache Managementkv-cache management for efficient LLM inferencefeature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions