Motivation
Based on the latest MoE developments in both trtllm and sglang, here is a proposal for a faster wide-ep All2all + MoE design: one-sided nvlink all2all + trtllm CuteDSLMoE + FC2/Combine overlap. Since there is already a version of CuTe DSL Blackwell masked GEMM kernel that has been integrated with SGLang that works best with DeepEP, I’ll call it CuteDSLMoE_v1 and call the trtllm CuteDSLMoE CuteDSLMoE_v2.
One-sided nvlink all2all (quantization-agnostic)
One-sided nvlink all2all, developed recently by trtllm eliminated intermediate copies and extra communication step in the trtllm two-sided all2all design and duplicated tokens in the DeepEP low-latency design and is the most efficient design from an algorithmic perspective. Perf data suggests that it’s superior to two-sided in almost all regions and only 1-2% slower for very small batch sizes. It requires nvlink between all the MoE ranks. There is an optional feature in the combine stage to use FP8 static per-tensor quantization to reduce communication volume.
trtllm NVFP4 CuteDSLMoE
One-sided nvlink all2all works the best with the trtllm CuteDSLMoE kernels because its the token-major data layout that the all2all kernels produces can be directly consumed by the trtllm CuteDSLMoE kernels without extra permutations. And the trtllm CuteDSLMoE kernels are best optimized for all scenarios except for low-latency where flashinfer_trtllm is still the best performing one. The reason is that trtllm CuteDSLMoE didn’t implement the swap-AB algorithm that is necessary for best low-latency perf.
Kernel level DownGemm(FC2)/Combine overlap
Kernel level FC2/Combine overlap can be implemented by enabling the GEMM epilogue to perform weighted scatter-reduce (combine) directly into multi-rank output buffers using cp.reduce.async.bulk PTX instructions (bf16 and f32 variants), as proposed by @nvcastet. This is currently implemented in DeepEP + CuteDSLMoE_v1, but the idea can be applied to CuteDSLMoE_v2 as well and get similar speedup.
Goal:
With One-sided nvlink all2all and trtllm CuteDSLMoE properly integrate, SGLang DSR1 nvfp4 should match trtllm decode aide wide-ep perf in the MoE stage on GB200/GB300. Kernel level FC2/Combine overlap should enable further speedup.
Relevant PRs:
#14668
#21339
#21877
Follow-ups
- Make sure EPLB (static and dynamic) works well with the new design
Related resources
No response
Motivation
Based on the latest MoE developments in both trtllm and sglang, here is a proposal for a faster wide-ep All2all + MoE design: one-sided nvlink all2all + trtllm CuteDSLMoE + FC2/Combine overlap. Since there is already a version of CuTe DSL Blackwell masked GEMM kernel that has been integrated with SGLang that works best with DeepEP, I’ll call it CuteDSLMoE_v1 and call the trtllm CuteDSLMoE CuteDSLMoE_v2.
One-sided nvlink all2all (quantization-agnostic)
One-sided nvlink all2all, developed recently by trtllm eliminated intermediate copies and extra communication step in the trtllm two-sided all2all design and duplicated tokens in the DeepEP low-latency design and is the most efficient design from an algorithmic perspective. Perf data suggests that it’s superior to two-sided in almost all regions and only 1-2% slower for very small batch sizes. It requires nvlink between all the MoE ranks. There is an optional feature in the combine stage to use FP8 static per-tensor quantization to reduce communication volume.
trtllm NVFP4 CuteDSLMoE
One-sided nvlink all2all works the best with the trtllm CuteDSLMoE kernels because its the token-major data layout that the all2all kernels produces can be directly consumed by the trtllm CuteDSLMoE kernels without extra permutations. And the trtllm CuteDSLMoE kernels are best optimized for all scenarios except for low-latency where flashinfer_trtllm is still the best performing one. The reason is that trtllm CuteDSLMoE didn’t implement the swap-AB algorithm that is necessary for best low-latency perf.
Kernel level DownGemm(FC2)/Combine overlap
Kernel level FC2/Combine overlap can be implemented by enabling the GEMM epilogue to perform weighted scatter-reduce (combine) directly into multi-rank output buffers using cp.reduce.async.bulk PTX instructions (bf16 and f32 variants), as proposed by @nvcastet. This is currently implemented in DeepEP + CuteDSLMoE_v1, but the idea can be applied to CuteDSLMoE_v2 as well and get similar speedup.
Goal:
With One-sided nvlink all2all and trtllm CuteDSLMoE properly integrate, SGLang DSR1 nvfp4 should match trtllm decode aide wide-ep perf in the MoE stage on GB200/GB300. Kernel level FC2/Combine overlap should enable further speedup.
Relevant PRs:
#14668
#21339
#21877
Follow-ups
Related resources
No response