-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
I noticed that in the code implementation the expert network INRNet is just a FC layer with positional embedding and no separate layer is added for each expert sub-network, whereas in the article it is written that ”To downsize the whole MoE layer, we share the positional embedding and the first 4 layers among all expert networks. Then we append two independent layers for each expert. We note this design can make two experts share the early-stage features and adjust their coherence.“
How to explain this, code wise its hardly a MoE, more like a MLP layer with sparse coding.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels