Skip to content

Agent MoE Experiment: may_arch + 1pct-noclip + col-norm on GN vs KV (d512/d768) #5767

@ClassicLarry

Description

@ClassicLarry

🤖 ## Original prompt

lets take the best d512 result, and try two variations at both d512 and d768: col norm on gated_norm, and col norm on kv projections.

Recipe

Base: 1pct-noclip (muonh-may-arch-gn-muonh-1pct-noclip-v1-d512 = 3.6427, current d512 best):

  • may_arch architecture (256 experts, PKO every 4th, partial rope, last_layer_pko, no router/logit z-loss)
  • 1% warmup
  • max_grad_norm = None (no clip)
  • token_embed → adamh_embed, output_proj → adamh
  • GN → muonh, KV → muonh (matrices)
  • LR scales 1.0× heuristic

Two variations:

trial what's routed through col-norm (NS + col-norm + hyperball)
gn-colnorm all 4 GatedNorm matrices (.w_up, .w_down)
kv-colnorm attn.w_k and attn.w_v only

Everything else identical to the 1pct-noclip baseline. 4 runs total (2 variants × d512, d768).

Why col-norm is interesting for these

GatedNorm at d512 is (512, 128) — 4:1 aspect; NS produces 128 orthonormal cols (length 512) but row norms spread ±15% around the mean. Col-norm equalizes the rows.

K/V at d512 with GQA 4:1 is (512, 128) — exactly the same shape as GatedNorm. So K/V might benefit from the same col-norm treatment that we suspect (per the prior colnorm experiment vs may_arch-default) is mildly helpful.

Branch / launcher

  • Branch: moe_muonh_may_arch_1pct_colnorm_variants (off main)
  • Launcher: experiments/grug/moe/muonh_may_arch_1pct_colnorm_variants_sweep.py
  • Submission target zone: us-central1-a (no --reserve).
.venv/bin/iris --config lib/iris/examples/marin.yaml job run \
  --no-wait --zone us-central1-a \
  -e WANDB_API_KEY "$WANDB_API_KEY" \
  -- python -m experiments.grug.moe.muonh_may_arch_1pct_colnorm_variants_sweep

Mask verified for both variants (smoke test before submission):

gn-colnorm:  GN→muonh_col_norm,  K/V→muonh
kv-colnorm:  GN→muonh,           K/V→muonh_col_norm,  rest unchanged

Comparison anchor at d512: 1pct-noclip baseline = 3.6427.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions