enable flashinfer moe kernel for DP + EP#36838
enable flashinfer moe kernel for DP + EP#36838czhu-cohere wants to merge 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: root <conway.zhu@cohere.com>
48b7786 to
ff3ff57
Compare
There was a problem hiding this comment.
Code Review
This pull request enables the FlashInfer CUTLASS MoE kernel for configurations using both Data Parallelism (DP) and Expert Parallelism (EP). The changes involve removing the restriction that prevented this kernel from being selected when DP is active. While the logic change appears correct, there is a lack of corresponding test updates to validate this new capability, which is a significant concern for ensuring correctness.
| flashinfer_cutlass_available = ( | ||
| has_flashinfer_cutlass_fused_moe() | ||
| and use_ep | ||
| and (not use_dp) | ||
| and current_platform.has_device_capability(90) | ||
| ) |
There was a problem hiding this comment.
This change enables the FlashInfer CUTLASS MoE kernel for configurations with Data Parallelism (use_dp=True). However, the corresponding tests in tests/kernels/moe/test_unquantized_backend_selection.py have not been updated to reflect this. The existing test test_select_cuda_flashinfer_cutlass_backend explicitly sets use_dp=False and includes a comment stating that CUTLASS does not support DP. To ensure the correctness of this feature and prevent future regressions, please add a new test case that validates the behavior when use_dp=True.
Purpose
Previously the BF16 flashinfer moe kernel is disabled when dp > 1. I think the kernel itself should be able to support it, we just need to enable on the vLLM side.
Test Plan
pytest tests/kernels/moe/test_unquantized_backend_selection.pyrun gsm8k with bf16 qwen 3a30b on 2xB200 DP2 EP2 and compare the result with different moe backend.
server command
test command
Test Result
pytest tests/kernels/moe/test_unquantized_backend_selection.pypassEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.