Skip to content

Agent MoE Experiment: Router bias nonzero init #4849

@ClassicLarry

Description

@ClassicLarry

TL;DR

Test initializing router_bias with nonzero values instead of zeros. Currently the QB router bias starts at zero and is updated via stop-gradient. This tests whether a nonzero init (using truncated normal scaled by initializer_std) helps early training by breaking symmetry in the router.

Parent tracking issue: #4281

User prompt: "implement router bias nonzero init. instead of initializing router_bias to zero, lets try using the cfg.initializer_std. lets try 4 variations with 1*cfg.initializer_std, 4*, 16*, and 64*. So 8 total runs for gate 1."

Scope

  • Baseline: router_bias initialized to zeros
  • Variant 1x: router_bias ~ truncated_normal(0, 1 * initializer_std)
  • Variant 4x: router_bias ~ truncated_normal(0, 4 * initializer_std)
  • Variant 16x: router_bias ~ truncated_normal(0, 16 * initializer_std)
  • Variant 64x: router_bias ~ truncated_normal(0, 64 * initializer_std)

Experiment grid

Config d512 (2.19e17) d768 (1.70e18)
bias=0 (baseline) from compute-optimal sweep from compute-optimal sweep
bias init 1x gate 1 gate 1
bias init 4x gate 1 gate 1
bias init 16x gate 1 gate 1
bias init 64x gate 1 gate 1

8 new runs total.

Metrics

  • eval/paloma/macro_loss (final)
  • throughput/tokens_per_second (avg last 100 steps)
  • throughput/total_tokens (final)

Success criteria

Effective speedup > 1 at both gate 1 scales (per experiments/grug/moe/agent.md).

Decision log

(to be updated)

Negative results index

(to be updated)

Current baseline

router_bias=zeros, compute-optimal sweep (wandb group compute-optimal-sweep).

Conclusion

(pending)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions