You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make the first 2 layers dense (3x hidden_dim MLP, no router, no shared expert). MoE routing starts at layer 2. Tests whether early layers benefit more from full-width dense computation.
User prompt
follow agent.md and try leading 2 dense layers. That is, make the first two layers have a shared expert of width 3x hidden_dim, and no routed experts. make sure that all the QB weights and logging stuff still works.
TL;DR
Make the first 2 layers dense (3x hidden_dim MLP, no router, no shared expert). MoE routing starts at layer 2. Tests whether early layers benefit more from full-width dense computation.
User prompt
Scope
moe_leading_denseexperiments/grug/moe/leading_dense_sweep.pyGrugModelConfig.num_leading_dense_layers=2Layer layout (d512, L=6)
Gate 1 runs (2 total)
Decision log
empty
Conclusion
pending