TL;DR
Test initializing router_bias with nonzero values instead of zeros. Currently the QB router bias starts at zero and is updated via stop-gradient. This tests whether a nonzero init (using truncated normal scaled by initializer_std) helps early training by breaking symmetry in the router.
Parent tracking issue: #4281
User prompt: "implement router bias nonzero init. instead of initializing router_bias to zero, lets try using the cfg.initializer_std. lets try 4 variations with 1*cfg.initializer_std, 4*, 16*, and 64*. So 8 total runs for gate 1."
Scope
- Baseline:
router_bias initialized to zeros
- Variant 1x:
router_bias ~ truncated_normal(0, 1 * initializer_std)
- Variant 4x:
router_bias ~ truncated_normal(0, 4 * initializer_std)
- Variant 16x:
router_bias ~ truncated_normal(0, 16 * initializer_std)
- Variant 64x:
router_bias ~ truncated_normal(0, 64 * initializer_std)
Experiment grid
| Config |
d512 (2.19e17) |
d768 (1.70e18) |
| bias=0 (baseline) |
from compute-optimal sweep |
from compute-optimal sweep |
| bias init 1x |
gate 1 |
gate 1 |
| bias init 4x |
gate 1 |
gate 1 |
| bias init 16x |
gate 1 |
gate 1 |
| bias init 64x |
gate 1 |
gate 1 |
8 new runs total.
Metrics
eval/paloma/macro_loss (final)
throughput/tokens_per_second (avg last 100 steps)
throughput/total_tokens (final)
Success criteria
Effective speedup > 1 at both gate 1 scales (per experiments/grug/moe/agent.md).
Decision log
(to be updated)
Negative results index
(to be updated)
Current baseline
router_bias=zeros, compute-optimal sweep (wandb group compute-optimal-sweep).
Conclusion
(pending)
TL;DR
Test initializing router_bias with nonzero values instead of zeros. Currently the QB router bias starts at zero and is updated via stop-gradient. This tests whether a nonzero init (using truncated normal scaled by initializer_std) helps early training by breaking symmetry in the router.
Parent tracking issue: #4281
User prompt: "implement router bias nonzero init. instead of initializing router_bias to zero, lets try using the cfg.initializer_std. lets try 4 variations with 1*cfg.initializer_std, 4*, 16*, and 64*. So 8 total runs for gate 1."
Scope
router_biasinitialized to zerosrouter_bias ~ truncated_normal(0, 1 * initializer_std)router_bias ~ truncated_normal(0, 4 * initializer_std)router_bias ~ truncated_normal(0, 16 * initializer_std)router_bias ~ truncated_normal(0, 64 * initializer_std)Experiment grid
8 new runs total.
Metrics
eval/paloma/macro_loss(final)throughput/tokens_per_second(avg last 100 steps)throughput/total_tokens(final)Success criteria
Effective speedup > 1 at both gate 1 scales (per
experiments/grug/moe/agent.md).Decision log
(to be updated)
Negative results index
(to be updated)
Current baseline
router_bias=zeros, compute-optimal sweep (wandb group
compute-optimal-sweep).Conclusion
(pending)