-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Summary
The dynamic shape of MoE (such as, required hip/cuda kernel results to allocate device memory) can lead to D2H synchronization and cause significant CPU overhead. This may increase device idle time and impose considerable performance impact on training, which becomes particularly noticeable when context-parallelism is enabled.
We have eliminated all CPU synchronization throughout MoE pipeline—from Router to Dispatcher, Permutation, and GroupMLP —reducing the idle time.
The Sync-Free MoE implementation involves multiple parameters and options, each with trade-offs. Optimal performance requires careful tuning based on specific training scenarios, making usage complex. This document proposes provides primus users with configurable options supporting multiple levels about Sync-free MoE, allowing to select the appropriate level based on their actual needs.
We hope that this design will:
- Provide simple and clear options with comprehensive documentation.
- Provide reasonable multi-level of Sync-Free MoE to cover most of primus users's needs.
SyncFree-MoE Workflow
As we can see from the above diagram, almost all component of Megatron MoE requires CPU synchronization:
- Router
- Dispatcher
- Permutation
- GroupMLP
TODO
Proposed Options for Primus-Megatron
We provide primus user turbo_sync_free_moe_stage option and divided into 4 levels (0-3):
- 0: Disable the Sync-Free MoE (default)
- 1: Remove synchronization for Router and Permutation
- 2: Remove synchronization for Router, DeepEP, and GroupMLP
- 3: Remove all MoE synchronization
Sync-Free-related function parameters or options in Primus-Megatron MoE, as following:
-
Router:
fused_group_topk_routing_with_aux_score: Primus-Megatron option to enable router fusion.
-
DeepEP
use_cuda_num_token_per_expert: Turbo-DeepEP dispatch parameter to returnnum_tokens_per_expertsby CUDA tensornum_worst_token: Turbo-DeepEP dispatch parameter to eliminate notify-dispatch CPU busy-wait
-
permuatation
moe_permutation_fusion: Primus-Megatron option to enable permutation fusion.
-
groupmlp
use_turbo_groupmlp: Primus-Megatron option to use Turbo's GroupMLP which accepted CUDAnum_token_per_expertsas parameter.use_turbo_groupmlp_act: Primus-Megatron option to use Turbo's activation which accepted CUDAnum_token_per_expertsas parameter.
| syncfree_moe_stage | Involved options or parameters | CPU Sync | Single Rank Memory Usage |
|---|---|---|---|
| 0 | / | / | / |
| 1 | fused_group_topk_routing_with_aux_scoremoe_permutation_fusion |
host sync at both deepep and groupmlp | / |
| 2 | fused_group_topk_routing_with_aux_score use_cuda_num_token_per_expertnum_worst_tokenmoe_permutation_fusionuse_turbo_groupmlp |
host sync once before permutation, CPU bottleneck mainly related to CPU overhead of permutation and group MLP. |
T * P (temporary) |
| 3 | fused_group_topk_routing_with_aux_score use_cuda_num_token_per_expertnum_worst_tokenmoe_permutation_fusionuse_turbo_groupmlpuse_turbo_groupmlp_act |
No | 3 *K * T * P (activation) |
- Assuming T = token bytes is batch_size * seqlen * hidden_size * element_size
- Assuming P = world size of EP
- Assuming K = topk
Note:
Assuming deepseekv3 with ep16 as example, each rank of stage2 only has 4096*7168*16*sizeof(bfloat16)=896MB of memory overhead. However, each rank of stage3 requires activation overhead is 4096*7168*16*sizeof (bfloat16) *8*61 *3 = 1281GB