[prototype] Expert Parallel #714

tianyu-l · 2024-12-03T04:10:00Z

Stack from ghstack (oldest at bottom):

-> [prototype] Expert Parallel #714

The expert-choice MoE implementation is mostly from torchtune: pytorch/torchtune#1902
The PR requires (ad hoc) changes to pytorch: pytorch/pytorch#141937

Issue tracking:

[dp2ep] how to apply FSDP only to the non-MoE modules? (for now need to comment out fully_shard)
[dp2ep] shard_dim_alltoall not robust (especially during backward with more than 1D)
[tp2ep] backward efficiency may not optimized (e.g. right now aten.scatter.src only supports replicate sharding prop)
[tp2ep] using DTensor (e.g. in "tp2ep"), the backward aten.scatter.src requires _allow_implicit_replication (maybe because in backward some tensor is not generated as DTensor)
some other issues tracked in ad hoc changes to unblock Expert Parallel prototype pytorch#141937
torch.compile fails on torch.topk

Haven't worked on

softmax scoring instead of sigmoid (can be done similarly, would incur extra communications)
part of DP (e.g. CP) to EP

Not considering

shared expert overlapping
token-choice MoE

[ghstack-poisoned]

ghstack-source-id: b4d3f46 Pull Request resolved: #714

The expert-choice MoE implementation is mostly from torchtune: pytorch/torchtune#1902 The PR requires (ad hoc) changes to pytorch: pytorch/pytorch#141937 Issue tracking: - [ ] [dp2ep] how to apply FSDP only to the non-MoE modules? - [ ] [dp2ep] `shard_dim_alltoall` not robust (especially during backward with more than 1D) - [ ] [tp2ep] backward efficiency may not optimized (e.g. right now `aten.scatter.src` only supports [replicate](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_tensor_ops.py#L368) sharding prop) - [ ] [tp2ep] using DTensor (e.g. in "tp2ep"), the backward `aten.scatter.src` requires `_allow_implicit_replication` (maybe because in backward some tensor is not generated as DTensor) - [ ] some other issues tracked in pytorch/pytorch#141937 - [ ] `torch.compile` fails on `torch.topk` Haven't worked on - softmax scoring instead of sigmoid (can be done similarly, would incur extra communications) - part of DP (e.g. CP) to EP Not considering - shared expert overlapping - token-choice MoE [ghstack-poisoned]

ghstack-source-id: d03719e Pull Request resolved: #714

lessw2020 · 2024-12-04T22:07:04Z

torchtitan/models/llama/moe_layers.py

+        if self.use_sigmoid:
+            scores = torch.sigmoid(scores.to(torch.float32)).to(x.dtype)
+        else:
+            scores = F.softmax(scores.to(torch.float32), dim=0).to(x.dtype)


one comment here - it's not clear imo if the optimal order is softmax and then topk, or topk and then softmax.
It does not change the routing, but it changes the weights.
In my nanogpt MoE it has the option to toggle the ordering.

tianyu-l · 2024-12-09T21:37:51Z

somehow I lost access to this ghstack, moving to #725 instead

[prototype] Expert Parallel

b1457bf

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Dec 3, 2024

[prototype] Expert Parallel

12a0615

ghstack-source-id: b4d3f46 Pull Request resolved: #714

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 3, 2024

tianyu-l marked this pull request as draft December 3, 2024 04:10

tianyu-l added a commit that referenced this pull request Dec 3, 2024

[prototype] Expert Parallel

339c737

ghstack-source-id: d03719e Pull Request resolved: #714

lessw2020 reviewed Dec 4, 2024

View reviewed changes

tianyu-l closed this Dec 9, 2024

tianyu-l deleted the gh/tianyu-l/21/head branch December 18, 2024 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[prototype] Expert Parallel #714

[prototype] Expert Parallel #714

Uh oh!

tianyu-l commented Dec 3, 2024 •

edited

Loading

Uh oh!

lessw2020 Dec 4, 2024

Uh oh!

tianyu-l commented Dec 9, 2024

Uh oh!

Uh oh!

[prototype] Expert Parallel #714

[prototype] Expert Parallel #714

Uh oh!

Conversation

tianyu-l commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lessw2020 Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

tianyu-l commented Dec 9, 2024

Uh oh!

Uh oh!

tianyu-l commented Dec 3, 2024 •

edited

Loading