Fix NSA FP8 KV cache path for both-trtllm MHA one-shot#18931
Fix NSA FP8 KV cache path for both-trtllm MHA one-shot#18931mmangkad wants to merge 5 commits intosgl-project:mainfrom
Conversation
Support compact(576) and packed(656) KV layouts in paged dequant, gate dequant path by backend storage mode, and pass explicit MLA dims.
Summary of ChangesHello @mmangkad, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a regression where a mismatch in FP8 KV cache layouts caused assertion errors during MHA one-shot operations, particularly when prefix sharing was involved. The primary fix involves updating the paged dequantization logic to correctly handle both 'compact' and 'packed' FP8 KV cache formats. This ensures compatibility across different backend storage modes and prevents crashes in high-concurrency scenarios by dynamically adapting to the KV cache's structure. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a regression causing a KV layout mismatch in MHA one-shot mode. The changes introduce support for both 'compact' and 'packed' FP8 layouts in dequantize_k_cache_paged by making it aware of model dimensions and adding gating logic to select the correct dequantization path. The fix appears robust and the call sites have been updated correctly. My review includes a minor suggestion to replace a magic number with a named constant for improved code clarity and maintainability.
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
@mmangkad From my understanding, the MHA_ONE_SHOT supported attn backends are "fa3", "flashinfer", "flashmla" per code https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/deepseek_common/attention_backend_handler.py#L64-L69 . Since the trtllmla attn backend is not in the list, if we find there is MHA_ONE_SHOT seq in the forward_batch, there probably some other bugs involved. Update: I checked the depensencies calling
|
@rainj-me my read is that this goes through the NSA-specific dispatcher: |
Then the fix should be only support MHA instead of MHA_ONE_SHOT with |
Thanks - I thought NSA and |
Fix the kv cache dequant to use 576 kv cache dim may lead to revert in future if trtllm support sparse kv cache. Base on this, I believe change the NSA routing should be a simple and sufficient fix. The only difference for MHA_ONE_SHOT and MHA is the forward_batch.mha_one_shot flag. |
I may be misunderstanding the “dequant to 576” concern, but this change does not hardcode 576; it handles both compact and packed KV layouts using runtime dimensions, and Also, I don’t think switching NSA from |
I just tested, with the following change instead of dequant should work. diff --git a/python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mha.py b/python/sglang/srt/models/de
epseek_common/attention_forward_methods/forward_mha.py
index 6eff360c4..db803c6c1 100644
--- a/python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mha.py
+++ b/python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mha.py
@@ -215,7 +215,10 @@ class DeepseekMHAForwardMixin:
forward_batch.mha_one_shot
and sum(forward_batch.extend_prefix_lens_cpu) != 0
):
- if self.use_nsa and self.kv_cache_dtype == "fp8_e4m3":
+ if self.use_nsa and self.kv_cache_dtype == "fp8_e4m3" and (
+ not get_global_server_args().nsa_decode_backend == 'trtllm'
+ or not get_global_server_args().nsa_prefill_backend == 'trtllm'
+ ):
# FP8 path: dequantize NSA-specific FP8 format to BF16
kv_a, k_pe = self._get_mla_kv_buffer_from_fp8_for_nsa(forward_batch)
else:For the testing with radix cache enabled For the testing without radix cache enabled |
|
@rainj-me thanks, nice find - I’ll revert my changes and apply this gate for the both-trtllm FP8 path. |
Co-authored-by: rainj-me <96632942+rainj-me@users.noreply.github.com>
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
Motivation
This is a follow-up to #18389 for the NSA FP8 KV cache path case when both prefill and decode backends are trtllm.
In this setup, the MHA one-shot prefix path could still enter paged dequant and hit a layout mismatch assertion (compact trtllm KV layout vs packed dequant expectation). For the both-trtllm configuration, this dequant step is unnecessary.
cc @rainj-me @Fridge003
Modifications
nsa_prefill_backendandnsa_decode_backendaretrtllm.Accuracy Tests
This is an example of a case that previously crashed with
AssertionError: dim_quant: 576 != 656.SGLANG_ENABLE_SPEC_V2=1 python -m sglang.launch_server \ --model-path nvidia/DeepSeek-V3.2-NVFP4 \ --tensor-parallel-size 4 \ --attention-backend nsa \ --nsa-prefill-backend trtllm \ --nsa-decode-backend trtllm \ --moe-runner-backend flashinfer_trtllm \ --quantization modelopt_fp4 \ --kv-cache-dtype fp8_e4m3 \ --reasoning-parser deepseek-v3 \ --tool-call-parser deepseekv32 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --model-loader-extra-config '{"enable_multithread_load": true,"num_threads": 96}'Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci