🤖 ## Original prompt
lets take the best d512 result, and try two variations at both d512 and d768: col norm on gated_norm, and col norm on kv projections.
Recipe
Base: 1pct-noclip (muonh-may-arch-gn-muonh-1pct-noclip-v1-d512 = 3.6427, current d512 best):
- may_arch architecture (256 experts, PKO every 4th, partial rope, last_layer_pko, no router/logit z-loss)
- 1% warmup
max_grad_norm = None (no clip)
token_embed → adamh_embed, output_proj → adamh
- GN →
muonh, KV → muonh (matrices)
- LR scales 1.0× heuristic
Two variations:
| trial |
what's routed through col-norm (NS + col-norm + hyperball) |
gn-colnorm |
all 4 GatedNorm matrices (.w_up, .w_down) |
kv-colnorm |
attn.w_k and attn.w_v only |
Everything else identical to the 1pct-noclip baseline. 4 runs total (2 variants × d512, d768).
Why col-norm is interesting for these
GatedNorm at d512 is (512, 128) — 4:1 aspect; NS produces 128 orthonormal cols (length 512) but row norms spread ±15% around the mean. Col-norm equalizes the rows.
K/V at d512 with GQA 4:1 is (512, 128) — exactly the same shape as GatedNorm. So K/V might benefit from the same col-norm treatment that we suspect (per the prior colnorm experiment vs may_arch-default) is mildly helpful.
Branch / launcher
- Branch:
moe_muonh_may_arch_1pct_colnorm_variants (off main)
- Launcher:
experiments/grug/moe/muonh_may_arch_1pct_colnorm_variants_sweep.py
- Submission target zone:
us-central1-a (no --reserve).
.venv/bin/iris --config lib/iris/examples/marin.yaml job run \
--no-wait --zone us-central1-a \
-e WANDB_API_KEY "$WANDB_API_KEY" \
-- python -m experiments.grug.moe.muonh_may_arch_1pct_colnorm_variants_sweep
Mask verified for both variants (smoke test before submission):
gn-colnorm: GN→muonh_col_norm, K/V→muonh
kv-colnorm: GN→muonh, K/V→muonh_col_norm, rest unchanged
Comparison anchor at d512: 1pct-noclip baseline = 3.6427.
🤖 ## Original prompt
Recipe
Base: 1pct-noclip (
muonh-may-arch-gn-muonh-1pct-noclip-v1-d512= 3.6427, current d512 best):max_grad_norm = None(no clip)token_embed → adamh_embed,output_proj → adamhmuonh, KV →muonh(matrices)Two variations:
gn-colnorm.w_up,.w_down)kv-colnormattn.w_kandattn.w_vonlyEverything else identical to the 1pct-noclip baseline. 4 runs total (2 variants × d512, d768).
Why col-norm is interesting for these
GatedNorm at d512 is
(512, 128)— 4:1 aspect; NS produces 128 orthonormal cols (length 512) but row norms spread ±15% around the mean. Col-norm equalizes the rows.K/V at d512 with GQA 4:1 is
(512, 128)— exactly the same shape as GatedNorm. So K/V might benefit from the same col-norm treatment that we suspect (per the prior colnorm experiment vs may_arch-default) is mildly helpful.Branch / launcher
moe_muonh_may_arch_1pct_colnorm_variants(off main)experiments/grug/moe/muonh_may_arch_1pct_colnorm_variants_sweep.pyus-central1-a(no--reserve).Mask verified for both variants (smoke test before submission):
Comparison anchor at d512: 1pct-noclip baseline = 3.6427.