Skip to content

RFC: Primus-Megatron SyncFree MoE #203

@zhenhuang12

Description

@zhenhuang12

Summary

The dynamic shape of MoE (such as, required hip/cuda kernel results to allocate device memory) can lead to D2H synchronization and cause significant CPU overhead. This may increase device idle time and impose considerable performance impact on training, which becomes particularly noticeable when context-parallelism is enabled.

We have eliminated all CPU synchronization throughout MoE pipeline—from Router to Dispatcher, Permutation, and GroupMLP —reducing the idle time.

The Sync-Free MoE implementation involves multiple parameters and options, each with trade-offs. Optimal performance requires careful tuning based on specific training scenarios, making usage complex. This document proposes provides primus users with configurable options supporting multiple levels about Sync-free MoE, allowing to select the appropriate level based on their actual needs.

We hope that this design will:

  • Provide simple and clear options with comprehensive documentation.
  • Provide reasonable multi-level of Sync-Free MoE to cover most of primus users's needs.

SyncFree-MoE Workflow

Image

As we can see from the above diagram, almost all component of Megatron MoE requires CPU synchronization:

  • Router
  • Dispatcher
  • Permutation
  • GroupMLP

TODO

Proposed Options for Primus-Megatron

We provide primus user turbo_sync_free_moe_stage option and divided into 4 levels (0-3):

  • 0: Disable the Sync-Free MoE (default)
  • 1: Remove synchronization for Router and Permutation
  • 2: Remove synchronization for Router, DeepEP, and GroupMLP
  • 3: Remove all MoE synchronization

Sync-Free-related function parameters or options in Primus-Megatron MoE, as following:

  • Router:

    • fused_group_topk_routing_with_aux_score: Primus-Megatron option to enable router fusion.
  • DeepEP

    • use_cuda_num_token_per_expert: Turbo-DeepEP dispatch parameter to return num_tokens_per_experts by CUDA tensor
    • num_worst_token: Turbo-DeepEP dispatch parameter to eliminate notify-dispatch CPU busy-wait
  • permuatation

    • moe_permutation_fusion: Primus-Megatron option to enable permutation fusion.
  • groupmlp

    • use_turbo_groupmlp: Primus-Megatron option to use Turbo's GroupMLP which accepted CUDA num_token_per_experts as parameter.
    • use_turbo_groupmlp_act: Primus-Megatron option to use Turbo's activation which accepted CUDA num_token_per_experts as parameter.
syncfree_moe_stage Involved options or parameters CPU Sync Single Rank Memory Usage
0 / / /
1 fused_group_topk_routing_with_aux_score
moe_permutation_fusion
host sync at both deepep and groupmlp /
2 fused_group_topk_routing_with_aux_score use_cuda_num_token_per_expert
num_worst_token
moe_permutation_fusion
use_turbo_groupmlp
host sync once before permutation, CPU bottleneck mainly related to CPU overhead of permutation and group MLP. T * P (temporary)
3 fused_group_topk_routing_with_aux_score use_cuda_num_token_per_expert
num_worst_token
moe_permutation_fusion
use_turbo_groupmlp
use_turbo_groupmlp_act
No 3 *K * T * P (activation)
  • Assuming T = token bytes is batch_size * seqlen * hidden_size * element_size
  • Assuming P = world size of EP
  • Assuming K = topk

Note:
Assuming deepseekv3 with ep16 as example, each rank of stage2 only has 4096*7168*16*sizeof(bfloat16)=896MB of memory overhead. However, each rank of stage3 requires activation overhead is 4096*7168*16*sizeof (bfloat16) *8*61 *3 = 1281GB

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions