Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prototype] Expert Parallel #714

Closed
wants to merge 2 commits into from
Closed

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Dec 3, 2024

Stack from ghstack (oldest at bottom):

The expert-choice MoE implementation is mostly from torchtune: pytorch/torchtune#1902
The PR requires (ad hoc) changes to pytorch: pytorch/pytorch#141937

Issue tracking:

  • [dp2ep] how to apply FSDP only to the non-MoE modules? (for now need to comment out fully_shard)
  • [dp2ep] shard_dim_alltoall not robust (especially during backward with more than 1D)
  • [tp2ep] backward efficiency may not optimized (e.g. right now aten.scatter.src only supports replicate sharding prop)
  • [tp2ep] using DTensor (e.g. in "tp2ep"), the backward aten.scatter.src requires _allow_implicit_replication (maybe because in backward some tensor is not generated as DTensor)
  • some other issues tracked in ad hoc changes to unblock Expert Parallel prototype pytorch#141937
  • torch.compile fails on torch.topk

Haven't worked on

  • softmax scoring instead of sigmoid (can be done similarly, would incur extra communications)
  • part of DP (e.g. CP) to EP

Not considering

  • shared expert overlapping
  • token-choice MoE

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Dec 3, 2024
ghstack-source-id: b4d3f46f9519f4a478fca22b5665bf72bfe01409
Pull Request resolved: #714
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 3, 2024
@tianyu-l tianyu-l marked this pull request as draft December 3, 2024 04:10
The expert-choice MoE implementation is mostly from torchtune: pytorch/torchtune#1902
The PR requires (ad hoc) changes to pytorch: pytorch/pytorch#141937

Issue tracking:
- [ ] [dp2ep] how to apply FSDP only to the non-MoE modules?
- [ ] [dp2ep] `shard_dim_alltoall` not robust (especially during backward with more than 1D)
- [ ] [tp2ep] backward efficiency may not optimized (e.g. right now `aten.scatter.src` only supports [replicate](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_tensor_ops.py#L368) sharding prop)
- [ ] [tp2ep] using DTensor (e.g. in "tp2ep"), the backward `aten.scatter.src` requires `_allow_implicit_replication` (maybe because in backward some tensor is not generated as DTensor)
- [ ] some other issues tracked in pytorch/pytorch#141937
- [ ] `torch.compile` fails on `torch.topk`

Haven't worked on
- softmax scoring instead of sigmoid (can be done similarly, would incur extra communications)
- part of DP (e.g. CP) to EP

Not considering
- shared expert overlapping
- token-choice MoE

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Dec 3, 2024
ghstack-source-id: d03719eb6b659c319631bed9b276d6bac6e7df8d
Pull Request resolved: #714
if self.use_sigmoid:
scores = torch.sigmoid(scores.to(torch.float32)).to(x.dtype)
else:
scores = F.softmax(scores.to(torch.float32), dim=0).to(x.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment here - it's not clear imo if the optimal order is softmax and then topk, or topk and then softmax.
It does not change the routing, but it changes the weights.
In my nanogpt MoE it has the option to toggle the ordering.

@tianyu-l
Copy link
Contributor Author

tianyu-l commented Dec 9, 2024

somehow I lost access to this ghstack, moving to #725 instead

@tianyu-l tianyu-l closed this Dec 9, 2024
@tianyu-l tianyu-l deleted the gh/tianyu-l/21/head branch December 18, 2024 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants