Fix LongCat MLP tensor parallelism#29515
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request integrates support for fully data-parallel (DP) dense MoE execution in the LongcatFlash model. It updates LongcatFlashMLP to accept tp_rank and tp_size parameters, and configures them based on the status of enable_moe_dense_fully_dp(). When fully DP is enabled, tensor model parallel all-reduce operations are bypassed during the MLP forward pass. There are no review comments, so I have no additional feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Hi @Fridge003 @ishandhanani @Qiaolin-Yu, this PR fixes LongCat-Flash dense MLP tensor parallelism for TP size = 1. Could you please take a look or help route it to the right reviewer when available? |
Motivation
LongCat-Flash 2P4D dense did not support running with TP size = 1 because the MLP tensor-parallel path assumed TP sharding.
This PR fixes the dense MLP path so it can run correctly without tensor parallelism.
Tests
Successfully ran LongCat-Flash 2P4D with moe_dense_tp_size = 1.
CI States
Latest PR Test (Base): ❌ Run #28291065388
Latest PR Test (Extra): ❌ Run #28291065332