TL;DR
Test replacing the standard residual stream with a weighted sum of per-layer contributions. Instead of the final hidden state, the output before final norm is λ0 * x0 + Σ(λi * hidden_diff_i) where x0 is the initial post-embedding state and hidden_diff_i is the residual contribution of layer i. Each λi is a unique learnable scalar (visible in wandb for analysis).
Parent tracking issue: #4281
User prompt: "try a feature called layer_grain_prediction. That is, lets save a lambda parameter for each layer, and compute hidden_new = block(hidden); hidden_diff=hidden_new-hidden;hidden=hidden_new;hidden_diffs.append(hidden_diff). then our output before the final norm is sum over [lambda_ihidden_diffs[i]]. I need lambda_i to be a unique parameter for each layer, so I can see the norms/values in wandb automatically. also let the final prediction include x0. so x0_lambdax0 + (everything else)"
Scope
- Baseline: standard residual stream (final hidden state = x0 + sum of all layer residuals)
- Variant:
output = λ0 * x0 + Σ(λi * diff_i) with learnable per-layer λi (init 1.0) and λ0 (init 1.0)
All other settings unchanged (E=64, K=4, full gate, same optimizer/schedule).
Experiment grid
| Config |
d512 (2.19e17) |
d768 (1.70e18) |
| standard residual (baseline) |
from compute-optimal sweep |
from compute-optimal sweep |
| layer-grain prediction |
gate 1 |
gate 1 |
2 new runs total.
Metrics
eval/paloma/macro_loss (final)
throughput/tokens_per_second (avg last 100 steps)
throughput/total_tokens (final)
- Per-layer λi values (visible as param norms in wandb)
Success criteria
Effective speedup > 1 at both gate 1 scales (per experiments/grug/moe/agent.md).
Decision log
(to be updated)
Negative results index
(to be updated)
Current baseline
Standard residual stream, compute-optimal sweep (wandb group compute-optimal-sweep).
Conclusion
(pending)
TL;DR
Test replacing the standard residual stream with a weighted sum of per-layer contributions. Instead of the final hidden state, the output before final norm is
λ0 * x0 + Σ(λi * hidden_diff_i)where x0 is the initial post-embedding state and hidden_diff_i is the residual contribution of layer i. Each λi is a unique learnable scalar (visible in wandb for analysis).Parent tracking issue: #4281
User prompt: "try a feature called layer_grain_prediction. That is, lets save a lambda parameter for each layer, and compute hidden_new = block(hidden); hidden_diff=hidden_new-hidden;hidden=hidden_new;hidden_diffs.append(hidden_diff). then our output before the final norm is sum over [lambda_ihidden_diffs[i]]. I need lambda_i to be a unique parameter for each layer, so I can see the norms/values in wandb automatically. also let the final prediction include x0. so x0_lambdax0 + (everything else)"
Scope
output = λ0 * x0 + Σ(λi * diff_i)with learnable per-layer λi (init 1.0) and λ0 (init 1.0)All other settings unchanged (E=64, K=4, full gate, same optimizer/schedule).
Experiment grid
2 new runs total.
Metrics
eval/paloma/macro_loss(final)throughput/tokens_per_second(avg last 100 steps)throughput/total_tokens(final)Success criteria
Effective speedup > 1 at both gate 1 scales (per
experiments/grug/moe/agent.md).Decision log
(to be updated)
Negative results index
(to be updated)
Current baseline
Standard residual stream, compute-optimal sweep (wandb group
compute-optimal-sweep).Conclusion
(pending)