RFC: Primus-Megatron SyncFree MoE

# Summary
The dynamic shape of MoE (such as, required hip/cuda kernel results to allocate device memory) can lead to D2H synchronization and cause significant CPU overhead. This may increase device idle time and impose considerable performance impact on training, which becomes particularly noticeable when context-parallelism is enabled.

We have eliminated all CPU synchronization throughout MoE pipeline—from `Router` to `Dispatcher`, `Permutation`, and `GroupMLP` —reducing the idle time.

The Sync-Free MoE implementation involves multiple parameters and options, each with trade-offs. Optimal performance requires careful tuning based on specific training scenarios, making usage complex. This document proposes provides primus users with configurable options supporting multiple levels about Sync-free MoE, allowing to select the appropriate level based on their actual needs.

We hope that this design will:
- Provide simple and clear options with comprehensive documentation.
- Provide reasonable multi-level of Sync-Free MoE to cover most of primus users's needs.



# SyncFree-MoE Workflow
<img width="350" alt="Image" src="https://github.com/user-attachments/assets/f8bb27d0-3d04-455f-8fc9-a45ca14aaccf" />

As we can see from the above diagram, almost all component of Megatron MoE requires CPU synchronization:
- Router
- Dispatcher
- Permutation
- GroupMLP

TODO

# Proposed Options for Primus-Megatron
We provide primus user `turbo_sync_free_moe_stage` option and divided into 4 levels (0-3):
- 0: Disable the Sync-Free MoE (default)
- 1: Remove synchronization for Router and Permutation
- 2: Remove synchronization for Router, DeepEP, and GroupMLP
- 3: Remove all MoE synchronization

Sync-Free-related function parameters or options in Primus-Megatron MoE, as following:
- Router:
 - `fused_group_topk_routing_with_aux_score`: Primus-Megatron option to enable router fusion.
 
- DeepEP
 - `use_cuda_num_token_per_expert`: Turbo-DeepEP dispatch parameter to return `num_tokens_per_experts` by CUDA tensor
 - `num_worst_token`: Turbo-DeepEP dispatch parameter to eliminate notify-dispatch CPU busy-wait

- permuatation
 - `moe_permutation_fusion`: Primus-Megatron option to enable permutation fusion.

- groupmlp
 - `use_turbo_groupmlp`: Primus-Megatron option to use Turbo's GroupMLP which accepted CUDA `num_token_per_experts` as parameter. 
 - `use_turbo_groupmlp_act`: Primus-Megatron option to use Turbo's activation which accepted CUDA `num_token_per_experts` as parameter. 



| syncfree_moe_stage | Involved options or parameters | CPU Sync | Single Rank Memory Usage |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- | ---------------------------- |
| 0 | / | / | / |
| 1 | `fused_group_topk_routing_with_aux_score` `moe_permutation_fusion` | host sync at both deepep and groupmlp | / |
| 2 | `fused_group_topk_routing_with_aux_score` `use_cuda_num_token_per_expert` `num_worst_token` `moe_permutation_fusion` `use_turbo_groupmlp` | host sync once before `permutation`, CPU bottleneck mainly related to CPU overhead of permutation and group MLP. | T \* P (temporary) |
| 3 | `fused_group_topk_routing_with_aux_score` `use_cuda_num_token_per_expert` `num_worst_token` `moe_permutation_fusion` `use_turbo_groupmlp` `use_turbo_groupmlp_act` | No | 3 \*K \* T \* P (activation) |
| | | | |

* Assuming T = token bytes is batch_size * seqlen * hidden_size * element_size
* Assuming P = world size of EP
* Assuming K = topk

**Note:** 
Assuming deepseekv3 with ep16 as example, **each rank of** stage2 only has 4096\*7168\*16\*sizeof(bfloat16)=896MB of memory overhead. However, **each rank of** stage3 requires activation overhead is 4096\*7168\*16\*sizeof (bfloat16) \*8\*61 *3 = 1281GB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Primus-Megatron SyncFree MoE #203

Summary

SyncFree-MoE Workflow

Proposed Options for Primus-Megatron

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

syncfree_moe_stage	Involved options or parameters	CPU Sync	Single Rank Memory Usage
0	/	/	/
1	`fused_group_topk_routing_with_aux_score` `moe_permutation_fusion`	host sync at both deepep and groupmlp	/
2	`fused_group_topk_routing_with_aux_score` `use_cuda_num_token_per_expert` `num_worst_token` `moe_permutation_fusion` `use_turbo_groupmlp`	host sync once before `permutation`, CPU bottleneck mainly related to CPU overhead of permutation and group MLP.	T * P (temporary)
3	`fused_group_topk_routing_with_aux_score` `use_cuda_num_token_per_expert` `num_worst_token` `moe_permutation_fusion` `use_turbo_groupmlp` `use_turbo_groupmlp_act`	No	3 K T * P (activation)

RFC: Primus-Megatron SyncFree MoE #203

Description

Summary

SyncFree-MoE Workflow

Proposed Options for Primus-Megatron

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions