Summary
This issue tracks the smaller good-enough 10T gated-norm ablation that unblocked later great-10T follow-up work. PR #4057 added GatedNorm, a gated_norm_rank config field, model wiring, tests, and a launch script for a roughly 1e19-FLOP baseline-vs-gated-norm comparison. The implementation work is complete and validated, but the thread still does not report experiment results, so the current conclusion is procedural rather than scientific: gated norms are now a real knob in the MoE stack, and this issue’s remaining job is to run the ablation and judge the outcome.
Helpful links
Description
TL;DR: Run a single roughly 1e19-FLOP comparison for gated norms in the good-enough 10T gate.
Hypothesis or Goal
We want to know whether gated norms belong in the baseline recipe once the rest of the stack is fixed.
Links
Results
Summary
This issue tracks the smaller good-enough 10T gated-norm ablation that unblocked later great-10T follow-up work. PR #4057 added
GatedNorm, agated_norm_rankconfig field, model wiring, tests, and a launch script for a roughly 1e19-FLOP baseline-vs-gated-norm comparison. The implementation work is complete and validated, but the thread still does not report experiment results, so the current conclusion is procedural rather than scientific: gated norms are now a real knob in the MoE stack, and this issue’s remaining job is to run the ablation and judge the outcome.Helpful links
Description
TL;DR: Run a single roughly 1e19-FLOP comparison for gated norms in the good-enough 10T gate.
Hypothesis or Goal
We want to know whether gated norms belong in the baseline recipe once the rest of the stack is fixed.
Links
Results