Skip to content

Commit 474a328

Browse files
gyou2021loadams
andauthored
Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)
Modified _replace_module in auto_tp.py : The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards. Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance. --------- Co-authored-by: Logan Adams <[email protected]>
1 parent 1062a0c commit 474a328

File tree

2 files changed

+3
-1
lines changed

2 files changed

+3
-1
lines changed

deepspeed/module_inject/auto_tp.py

100644100755
+2-1
Original file line numberDiff line numberDiff line change
@@ -333,7 +333,8 @@ def _replace(self, child, name, conv_linear_layer):
333333
weight_shape = child.weight.shape
334334
mp_replace = ReplaceWithTensorSlicing(mp_group=self.mp_group)
335335
# For mixtral-7x8b, need to skip MoE gate linear replace.
336-
if name == "block_sparse_moe.gate":
336+
if name == "block_sparse_moe.gate" or (('mlp.shared_expert_gate' == name or 'mlp.gate' == name)
337+
and 'qwen2_moe' in str(type(self.module))):
337338
return child
338339
# For Yuan model
339340
if 'Yuan' in str(self.module):

docs/_tutorials/automatic-tensor-parallelism.md

100644100755
+1
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@ The following model families have been successfully tested with automatic tensor
158158
- plbart
159159
- qwen
160160
- qwen2
161+
- qwen2-moe
161162
- reformer
162163
- roberta
163164
- roformer

0 commit comments

Comments
 (0)