Agent MoE Experiment: Leading 2 dense layers (no MoE on first 2)

## TL;DR
Make the first 2 layers dense (3x hidden_dim MLP, no router, no shared expert). MoE routing starts at layer 2. Tests whether early layers benefit more from full-width dense computation.

## User prompt
> follow agent.md and try leading 2 dense layers. That is, make the first two layers have a shared expert of width 3x hidden_dim, and no routed experts. make sure that all the QB weights and logging stuff still works.

## Scope
- **Parent**: #4281
- **Branch**: `moe_leading_dense`
- **Sweep file**: `experiments/grug/moe/leading_dense_sweep.py`
- **Config**: `GrugModelConfig.num_leading_dense_layers=2`

## Layer layout (d512, L=6)
- Layer 0-1: Dense MLP (3x width)
- Layer 2-5: MoE (E=64, K=4 + shared)

## Gate 1 runs (2 total)
| Dim | Budget | Status |
|-----|--------|--------|
| 512 | 2.19e17 | pending |
| 768 | 1.70e18 | pending |

## Decision log
_empty_

## Conclusion
_pending_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent MoE Experiment: Leading 2 dense layers (no MoE on first 2) #5233

TL;DR

User prompt

Scope

Layer layout (d512, L=6)

Gate 1 runs (2 total)

Decision log

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent MoE Experiment: Leading 2 dense layers (no MoE on first 2) #5233

Description

TL;DR

User prompt

Scope

Layer layout (d512, L=6)

Gate 1 runs (2 total)

Decision log

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions