Skip to content

Agent MoE Experiment: Leading 2 dense layers (no MoE on first 2) #5233

@ClassicLarry

Description

@ClassicLarry

TL;DR

Make the first 2 layers dense (3x hidden_dim MLP, no router, no shared expert). MoE routing starts at layer 2. Tests whether early layers benefit more from full-width dense computation.

User prompt

follow agent.md and try leading 2 dense layers. That is, make the first two layers have a shared expert of width 3x hidden_dim, and no routed experts. make sure that all the QB weights and logging stuff still works.

Scope

  • Parent: MoE Scaling up to April goal #4281
  • Branch: moe_leading_dense
  • Sweep file: experiments/grug/moe/leading_dense_sweep.py
  • Config: GrugModelConfig.num_leading_dense_layers=2

Layer layout (d512, L=6)

  • Layer 0-1: Dense MLP (3x width)
  • Layer 2-5: MoE (E=64, K=4 + shared)

Gate 1 runs (2 total)

Dim Budget Status
512 2.19e17 pending
768 1.70e18 pending

Decision log

empty

Conclusion

pending

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions