-
Notifications
You must be signed in to change notification settings - Fork 651
Open
Description
Hi team,
I tried to duplicate the EP implementation in my model. But I find it's running much slowly with EP.
I find there is a written cpu-gpu synchronization at the beginning of all2all in token dispatch, for input_split and output_split, which is kinda a blocker. Is it possible to avoid it without symmetric memory all2all?
Besides, could you help to share which part of EP workflow needs torch.compile? I noticed the usage of torch.gather and torch.scatter_add may not be optimal. I guess they may need to be optimized by torch.compile.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels