fix: call fa3 to do mla instead of flashmla #124

tianhaox · 2025-11-23T15:02:03Z

Overview:

for regular path deepseek, on hopper, it by default calls fa3 to do mla for prefill and decode.
On Blackwell, it calls trtllm-gen kernels. But this is not verified yet. All code is verified on hopper.

Limitations:

the efficiency of fa3 for context mla is not that good for GQA, 30-50% perf of GQA.
generation attention fp8 kvcache perf is very poor. https://docs.sglang.io/advanced_features/attention_backend.html
current mla kvcache pool op has bug. It uses int32 for indexing and it limits bs*seq_len up to ~4M tokens. Modify the collector a bit to remove those test cases.

Signed-off-by: Tianhao Xu <tianhaox@nvidia.com>

copy-pr-bot · 2025-11-23T15:02:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

davilu-nvidia

LGTM

call fa3 to do mla instead of flashmla

6dea47e

Signed-off-by: Tianhao Xu <tianhaox@nvidia.com>

tianhaox requested review from YijiaZhao and jasonqinzhou November 23, 2025 15:02

tianhaox requested review from AichenF and xutizhou as code owners November 23, 2025 15:02

github-actions bot added the fix label Nov 23, 2025

tianhaox requested a review from davilu-nvidia November 26, 2025 14:44

davilu-nvidia approved these changes Nov 26, 2025

View reviewed changes

tianhaox merged commit 3733cdd into ai-dynamo:main Nov 26, 2025
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: call fa3 to do mla instead of flashmla #124

fix: call fa3 to do mla instead of flashmla #124

Uh oh!

tianhaox commented Nov 23, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 23, 2025

Uh oh!

davilu-nvidia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: call fa3 to do mla instead of flashmla #124

fix: call fa3 to do mla instead of flashmla #124

Uh oh!

Conversation

tianhaox commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Uh oh!

copy-pr-bot bot commented Nov 23, 2025

Uh oh!

davilu-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianhaox commented Nov 23, 2025 •

edited

Loading