feat: Support flashinfer_cutedsl MoE runner with flashinfer alltoall backend#22669
Open
samuellees wants to merge 1 commit intosgl-project:mainfrom
Open
feat: Support flashinfer_cutedsl MoE runner with flashinfer alltoall backend#22669samuellees wants to merge 1 commit intosgl-project:mainfrom
samuellees wants to merge 1 commit intosgl-project:mainfrom
Conversation
…backend Enable CuteDSL FP4 MoE runner to work with FlashInfer one-sided alltoall (NVLink) dispatch for DP attention + EP configurations. Changes: - server_args.py: Add flashinfer_cutedsl to flashinfer a2a whitelist - qwen2_moe.py: M=0 guard for idle DP ranks (skip shared_expert/gate), skip TP allreduce in a2a mode (combine already aggregates), shared_expert tp_size=1 for a2a (matching deepep behavior) - flashinfer.py: Remove dummy token mechanism, pass 0-size tensors directly to alltoall kernel (matching TRT-LLM), add invalid_token_expert_id for padding sanitization, increase default max dispatch tokens per rank - flashinfer_cutedsl.py: Register fused func for flashinfer a2a, scale wrapper max_num_tokens by ep_size for a2a layout Tested: Qwen3.5-397B-A17B-NVFP4, B200x4, EP=4 DP=4 - Output verified correct (Paris.) - GPQA accuracy: 86.2% (8 repeats, 198 examples)
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
Author
YAMY1234
reviewed
Apr 13, 2026
| # Unlike the old dummy-token approach, we pass 0-size tensors directly | ||
| # to the alltoall kernel, which handles local_num_tokens=0 natively | ||
| # (same as TRT-LLM). The kernel keeps 1 thread alive for sync. | ||
| self.has_dummy_token = x.shape[0] == 0 |
Contributor
There was a problem hiding this comment.
Maybe we could rename it, something like is_idle_rank for better clarity?
YAMY1234
reviewed
Apr 13, 2026
Comment on lines
151
to
169
| self.dummy_x = torch.empty( | ||
| (1, hidden_size), | ||
| dtype=torch.bfloat16, | ||
| device="cuda", | ||
| ) | ||
| # -1 will be ignored by flashinfer cutlass moe | ||
| self.dummy_topk_ids = torch.full( | ||
| (1, self.router_topk), -1, dtype=torch.int32, device="cuda" | ||
| ) | ||
| # Hack for dispatch with dummy token - will route the dummy token to this rank so it doesn't require any transfer. | ||
| self.dummy_topk_ids_current_rank = torch.full( | ||
| (1, self.router_topk), | ||
| self.ep_rank * self.num_local_experts, | ||
| dtype=torch.int32, | ||
| device="cuda", | ||
| ) | ||
| self.dummy_topk_weights = torch.zeros( | ||
| (1, self.router_topk), dtype=torch.float32, device="cuda" | ||
| ) |
Contributor
There was a problem hiding this comment.
Will these still be needed in other places? since we are removing the usage of these variables in this file
YAMY1234
reviewed
Apr 13, 2026
| # cutedsl uses input_scale and non-interleaved x_sf. | ||
| # These may differ. For now pass through; if kernel crashes, | ||
| # need to de-interleave x_sf or disable NVFP4 dispatch. | ||
| x_fp4 = hidden_states |
Contributor
There was a problem hiding this comment.
Optional: Are we able to verify this path? If it is not supported, could we consider forcibly disabling NVFP4_DISPATCH in server_args when cutedsl + flashinfer a2a is detected, and emit a warning?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends on flashinfer-ai/flashinfer#3021 for topk=10.
Summary
--moe-runner-backend flashinfer_cutedsl) to work with FlashInfer one-sided alltoall dispatch (--moe-a2a-backend flashinfer) for DP attention + EP configurationsChanges
server_args.py"flashinfer"to cutedsl'smoe_a2a_backendwhitelist"flashinfer_cutedsl"to flashinfer a2a'smoe_runner_backendwhitelistqwen2_moe.pyshared_expertandgate(FP4 GEMM cannot handle empty tensors), but still callself.experts()to participate in alltoall collectivetensor_model_parallel_all_reducecauses NCCL size mismatch between active/idle rankstp_size=1: In a2a mode, shared_expert must not be TP-sharded (same as deepep), because TP allreduce on shared_expert output fails when idle DP ranks have 0 tokensflashinfer.py(token dispatcher)local_num_tokens=0invalid_token_expert_id: Mark padding slots with invalid expert ID so MoE kernels skip themflashinfer_cutedsl.py(MoE runner)("flashinfer", "flashinfer_cutedsl")for flashinfer alltoall dispatchermax_num_tokensbyep_sizein a2a mode, since wrapper receivesep_size * runtime_maxtokens after alltoall dispatchReproduce
Hardware: B200 × 4
Baseline (cutedsl + DP, no a2a)
This PR (cutedsl + DP + flashinfer a2a)
GSM8K eval
GPQA eval
Accuracy
Accuracy identical between baseline and a2a.
Test plan
test/registered/moe/test_cutedsl_a2a.py— GSM8K > 90%The capital of France is→Paris.