Skip to content

[Feature] Large-EP MoE Redesign #22829

@hlu1

Description

@hlu1

Motivation

Based on the latest MoE developments in both trtllm and sglang, here is a proposal for a faster wide-ep All2all + MoE design: one-sided nvlink all2all + trtllm CuteDSLMoE + FC2/Combine overlap. Since there is already a version of CuTe DSL Blackwell masked GEMM kernel that has been integrated with SGLang that works best with DeepEP, I’ll call it CuteDSLMoE_v1 and call the trtllm CuteDSLMoE CuteDSLMoE_v2.

One-sided nvlink all2all (quantization-agnostic)

One-sided nvlink all2all, developed recently by trtllm eliminated intermediate copies and extra communication step in the trtllm two-sided all2all design and duplicated tokens in the DeepEP low-latency design and is the most efficient design from an algorithmic perspective. Perf data suggests that it’s superior to two-sided in almost all regions and only 1-2% slower for very small batch sizes. It requires nvlink between all the MoE ranks. There is an optional feature in the combine stage to use FP8 static per-tensor quantization to reduce communication volume.

trtllm NVFP4 CuteDSLMoE

One-sided nvlink all2all works the best with the trtllm CuteDSLMoE kernels because its the token-major data layout that the all2all kernels produces can be directly consumed by the trtllm CuteDSLMoE kernels without extra permutations. And the trtllm CuteDSLMoE kernels are best optimized for all scenarios except for low-latency where flashinfer_trtllm is still the best performing one. The reason is that trtllm CuteDSLMoE didn’t implement the swap-AB algorithm that is necessary for best low-latency perf.

Kernel level DownGemm(FC2)/Combine overlap

Kernel level FC2/Combine overlap can be implemented by enabling the GEMM epilogue to perform weighted scatter-reduce (combine) directly into multi-rank output buffers using cp.reduce.async.bulk PTX instructions (bf16 and f32 variants), as proposed by @nvcastet. This is currently implemented in DeepEP + CuteDSLMoE_v1, but the idea can be applied to CuteDSLMoE_v2 as well and get similar speedup.

Goal:
With One-sided nvlink all2all and trtllm CuteDSLMoE properly integrate, SGLang DSR1 nvfp4 should match trtllm decode aide wide-ep perf in the MoE stage on GB200/GB300. Kernel level FC2/Combine overlap should enable further speedup.

Relevant PRs:
#14668
#21339
#21877

Follow-ups

  • Make sure EPLB (static and dynamic) works well with the new design

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions