-
Notifications
You must be signed in to change notification settings - Fork 661
[Feature]Use DispatchGmmCombineDecode operator to replace MC2(Optional) #5040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the DispatchGmmCombineDecode operator as a new fused MoE communication method (FUSED_MC2) to replace the existing MC2 implementation for w8a8_dynamic quantization on Ascend hardware. The changes include adding the necessary enum, environment variable, and implementation logic. My review focuses on improving code clarity and fixing a potential bug related to handling shared experts. I've identified a misleading docstring, hardcoded values that should be parameterized, and a complex conditional that could be simplified for better maintainability.
| shared_expert_num=1, | ||
| shared_expert_rank_num=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameters shared_expert_num and shared_expert_rank_num are hardcoded. This implementation ignores the shared_experts parameter passed to the function, which likely contains the necessary information for handling shared experts. This will lead to incorrect behavior when shared experts are used. The values should be derived from the function arguments to correctly support shared experts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| """This implementation is for the scenarios listed below: | ||
| 1. `enable_expert_parallel=True`. | ||
| 2. `npu_moe_distribute_dispatch` and `npu_moe_distribute_combine` are available. | ||
| 3. `enable_expert_parallel=False` is not supported. | ||
| This implementation uses the FusedMC2 communication method, which is optimized for | ||
| Communication and Computation parallelism on Ascend devices. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for FusedMC2CommImpl appears to be copied from MC2CommImpl and is misleading. It mentions npu_moe_distribute_dispatch and npu_moe_distribute_combine, but this implementation uses the dispatch_gmm_combine_decode operator. The docstring should be updated to reflect the actual implementation and its requirements, such as being specific to w8a8_dynamic quantization, to improve maintainability.
| """This implementation is for the scenarios listed below: | |
| 1. `enable_expert_parallel=True`. | |
| 2. `npu_moe_distribute_dispatch` and `npu_moe_distribute_combine` are available. | |
| 3. `enable_expert_parallel=False` is not supported. | |
| This implementation uses the FusedMC2 communication method, which is optimized for | |
| Communication and Computation parallelism on Ascend devices. | |
| """ | |
| """This implementation is for the scenarios listed below: | |
| 1. `enable_expert_parallel=True`. | |
| 2. `VLLM_ASCEND_ENABLE_FUSED_MC2` is enabled. | |
| 3. `w8a8_dynamic` quantization is used. | |
| This implementation uses the `dispatch_gmm_combine_decode` operator, which is a fused | |
| operator for MoE decoding that combines communication and computation for optimization | |
| on Ascend devices. | |
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| moe_comm_type = ( | ||
| MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else | ||
| MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic" | ||
| and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL) | ||
| (MoECommType.FUSED_MC2 if envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 and quant_type == "w8a8_dynamic" | ||
| else MoECommType.MC2) if num_tokens <= mc2_tokens_capacity | ||
| else MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic" and get_ep_group().world_size <= 16 | ||
| else MoECommType.ALLTOALL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nested ternary expression is difficult to read and maintain. Consider refactoring it into a more explicit if/else structure to improve clarity.
| moe_comm_type = ( | |
| MoECommType.MC2 if num_tokens <= mc2_tokens_capacity else | |
| MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic" | |
| and get_ep_group().world_size <= 16 else MoECommType.ALLTOALL) | |
| (MoECommType.FUSED_MC2 if envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 and quant_type == "w8a8_dynamic" | |
| else MoECommType.MC2) if num_tokens <= mc2_tokens_capacity | |
| else MoECommType.FUSED_ALLTOALL if quant_type == "w8a8_dynamic" and get_ep_group().world_size <= 16 | |
| else MoECommType.ALLTOALL) | |
| if num_tokens <= mc2_tokens_capacity: | |
| if envs_ascend.VLLM_ASCEND_ENABLE_FUSED_MC2 and quant_type == "w8a8_dynamic": | |
| moe_comm_type = MoECommType.FUSED_MC2 | |
| else: | |
| moe_comm_type = MoECommType.MC2 | |
| elif quant_type == "w8a8_dynamic" and get_ep_group().world_size <= 16: | |
| moe_comm_type = MoECommType.FUSED_ALLTOALL | |
| else: | |
| moe_comm_type = MoECommType.ALLTOALL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
388dd0e to
35add0e
Compare
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
35add0e to
f0d7faa
Compare
…bineDecode. This commit adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding. The operator implementation itself was added in a prior PR vllm-project#4139. This change only adapts the model execution path to optionally use the fused operator. When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=1 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) is replaced by the single fused operator DispatchGmmCombineDecode. By default, the existing multi-operator MC2 implementation is preserved. Signed-off-by: wangqiankun <[email protected]>
f0d7faa to
a8ceb08
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
This PR adds model-side integration for the previously introduced experimental AscendC fused operator DispatchGmmCombineDecode, used in MoE decoding.
The operator implementation itself was added in a prior PR#4139 .
This change only adapts the model execution path to optionally use the fused operator.
When the environment variable VLLM_ASCEND_ENABLE_FUSED_MC2=1 is set, the original MC2 path composed of multiple operators (A8W8 dispatch → GMM → SwiGLU → GMM → combine) is replaced by the single fused operator DispatchGmmCombineDecode.
By default, the existing multi-operator MC2 implementation is preserved.
Does this PR introduce any user-facing change?
How was this patch tested?