Difference between code and paper

I noticed that in the code implementation the expert network INRNet is just a FC layer with positional embedding and no separate layer is added for each expert sub-network, whereas in the article it is written that **”To downsize the whole MoE layer, we share the positional embedding and the first 4 layers among all expert networks. Then we append two independent layers for each expert. We note this design can make two experts share the early-stage features and adjust their coherence.“**

How to explain this, code wise its hardly a MoE, more like a MLP layer with sparse coding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between code and paper #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Difference between code and paper #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions