Agent MoE Experiment: Router bias nonzero init

## TL;DR

Test initializing router_bias with nonzero values instead of zeros. Currently the QB router bias starts at zero and is updated via stop-gradient. This tests whether a nonzero init (using truncated normal scaled by initializer_std) helps early training by breaking symmetry in the router.

Parent tracking issue: #4281

**User prompt:** "implement router bias nonzero init. instead of initializing router_bias to zero, lets try using the cfg.initializer_std. lets try 4 variations with 1\*cfg.initializer_std, 4\*, 16\*, and 64\*. So 8 total runs for gate 1."

## Scope

- **Baseline**: `router_bias` initialized to zeros
- **Variant 1x**: `router_bias ~ truncated_normal(0, 1 * initializer_std)`
- **Variant 4x**: `router_bias ~ truncated_normal(0, 4 * initializer_std)`
- **Variant 16x**: `router_bias ~ truncated_normal(0, 16 * initializer_std)`
- **Variant 64x**: `router_bias ~ truncated_normal(0, 64 * initializer_std)`

### Experiment grid

| Config | d512 (2.19e17) | d768 (1.70e18) |
|---|---|---|
| bias=0 (baseline) | from compute-optimal sweep | from compute-optimal sweep |
| bias init 1x | gate 1 | gate 1 |
| bias init 4x | gate 1 | gate 1 |
| bias init 16x | gate 1 | gate 1 |
| bias init 64x | gate 1 | gate 1 |

8 new runs total.

### Metrics
- `eval/paloma/macro_loss` (final)
- `throughput/tokens_per_second` (avg last 100 steps)
- `throughput/total_tokens` (final)

### Success criteria
Effective speedup > 1 at both gate 1 scales (per `experiments/grug/moe/agent.md`).

## Decision log

*(to be updated)*

## Negative results index

*(to be updated)*

## Current baseline

router_bias=zeros, compute-optimal sweep (wandb group `compute-optimal-sweep`).

## Conclusion

*(pending)*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent MoE Experiment: Router bias nonzero init #4849

TL;DR

Scope

Experiment grid

Metrics

Success criteria

Decision log

Negative results index

Current baseline

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Config	d512 (2.19e17)	d768 (1.70e18)
bias=0 (baseline)	from compute-optimal sweep	from compute-optimal sweep
bias init 1x	gate 1	gate 1
bias init 4x	gate 1	gate 1
bias init 16x	gate 1	gate 1
bias init 64x	gate 1	gate 1

Agent MoE Experiment: Router bias nonzero init #4849

Description

TL;DR

Scope

Experiment grid

Metrics

Success criteria

Decision log

Negative results index

Current baseline

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions