-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[prototype] Expert Parallel #714
Conversation
[ghstack-poisoned]
ghstack-source-id: b4d3f46f9519f4a478fca22b5665bf72bfe01409 Pull Request resolved: #714
The expert-choice MoE implementation is mostly from torchtune: pytorch/torchtune#1902 The PR requires (ad hoc) changes to pytorch: pytorch/pytorch#141937 Issue tracking: - [ ] [dp2ep] how to apply FSDP only to the non-MoE modules? - [ ] [dp2ep] `shard_dim_alltoall` not robust (especially during backward with more than 1D) - [ ] [tp2ep] backward efficiency may not optimized (e.g. right now `aten.scatter.src` only supports [replicate](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/_tensor_ops.py#L368) sharding prop) - [ ] [tp2ep] using DTensor (e.g. in "tp2ep"), the backward `aten.scatter.src` requires `_allow_implicit_replication` (maybe because in backward some tensor is not generated as DTensor) - [ ] some other issues tracked in pytorch/pytorch#141937 - [ ] `torch.compile` fails on `torch.topk` Haven't worked on - softmax scoring instead of sigmoid (can be done similarly, would incur extra communications) - part of DP (e.g. CP) to EP Not considering - shared expert overlapping - token-choice MoE [ghstack-poisoned]
ghstack-source-id: d03719eb6b659c319631bed9b276d6bac6e7df8d Pull Request resolved: #714
if self.use_sigmoid: | ||
scores = torch.sigmoid(scores.to(torch.float32)).to(x.dtype) | ||
else: | ||
scores = F.softmax(scores.to(torch.float32), dim=0).to(x.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one comment here - it's not clear imo if the optimal order is softmax and then topk, or topk and then softmax.
It does not change the routing, but it changes the weights.
In my nanogpt MoE it has the option to toggle the ordering.
somehow I lost access to this ghstack, moving to #725 instead |
Stack from ghstack (oldest at bottom):
The expert-choice MoE implementation is mostly from torchtune: pytorch/torchtune#1902
The PR requires (ad hoc) changes to pytorch: pytorch/pytorch#141937
Issue tracking:
fully_shard
)shard_dim_alltoall
not robust (especially during backward with more than 1D)aten.scatter.src
only supports replicate sharding prop)aten.scatter.src
requires_allow_implicit_replication
(maybe because in backward some tensor is not generated as DTensor)torch.compile
fails ontorch.topk
Haven't worked on
Not considering