Skip to content

Conversation

@tianhaox
Copy link
Contributor

@tianhaox tianhaox commented Nov 23, 2025

Overview:

for regular path deepseek, on hopper, it by default calls fa3 to do mla for prefill and decode.
On Blackwell, it calls trtllm-gen kernels. But this is not verified yet. All code is verified on hopper.

Limitations:

  1. the efficiency of fa3 for context mla is not that good for GQA, 30-50% perf of GQA.
  2. generation attention fp8 kvcache perf is very poor. https://docs.sglang.io/advanced_features/attention_backend.html
  3. current mla kvcache pool op has bug. It uses int32 for indexing and it limits bs*seq_len up to ~4M tokens. Modify the collector a bit to remove those test cases.
image image

Signed-off-by: Tianhao Xu <tianhaox@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@davilu-nvidia davilu-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tianhaox tianhaox merged commit 3733cdd into ai-dynamo:main Nov 26, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants