[prototype] Expert Parallel #714
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
The expert-choice MoE implementation is mostly from torchtune: pytorch/torchtune#1902
The PR requires (ad hoc) changes to pytorch: pytorch/pytorch#141937
Issue tracking:
fully_shard
)shard_dim_alltoall
not robust (especially during backward with more than 1D)aten.scatter.src
only supports replicate sharding prop)aten.scatter.src
requires_allow_implicit_replication
(maybe because in backward some tensor is not generated as DTensor)torch.compile
fails ontorch.topk
Haven't worked on
Not considering