Agent MoE Experiment: Layer-grain prediction (weighted sum of per-layer diffs)

## TL;DR

Test replacing the standard residual stream with a weighted sum of per-layer contributions. Instead of the final hidden state, the output before final norm is `λ0 * x0 + Σ(λi * hidden_diff_i)` where x0 is the initial post-embedding state and hidden_diff_i is the residual contribution of layer i. Each λi is a unique learnable scalar (visible in wandb for analysis).

Parent tracking issue: #4281

**User prompt:** "try a feature called layer_grain_prediction. That is, lets save a lambda parameter for each layer, and compute hidden_new = block(hidden); hidden_diff=hidden_new-hidden;hidden=hidden_new;hidden_diffs.append(hidden_diff). then our output before the final norm is sum over [lambda_i*hidden_diffs[i]]. I need lambda_i to be a unique parameter for each layer, so I can see the norms/values in wandb automatically. also let the final prediction include x0. so x0_lambda*x0 + (everything else)"

## Scope

- **Baseline**: standard residual stream (final hidden state = x0 + sum of all layer residuals)
- **Variant**: `output = λ0 * x0 + Σ(λi * diff_i)` with learnable per-layer λi (init 1.0) and λ0 (init 1.0)

All other settings unchanged (E=64, K=4, full gate, same optimizer/schedule).

### Experiment grid

| Config | d512 (2.19e17) | d768 (1.70e18) |
|---|---|---|
| standard residual (baseline) | from compute-optimal sweep | from compute-optimal sweep |
| layer-grain prediction | gate 1 | gate 1 |

2 new runs total.

### Metrics
- `eval/paloma/macro_loss` (final)
- `throughput/tokens_per_second` (avg last 100 steps)
- `throughput/total_tokens` (final)
- Per-layer λi values (visible as param norms in wandb)

### Success criteria
Effective speedup > 1 at both gate 1 scales (per `experiments/grug/moe/agent.md`).

## Decision log

*(to be updated)*

## Negative results index

*(to be updated)*

## Current baseline

Standard residual stream, compute-optimal sweep (wandb group `compute-optimal-sweep`).

## Conclusion

*(pending)*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent MoE Experiment: Layer-grain prediction (weighted sum of per-layer diffs) #4807

TL;DR

Scope

Experiment grid

Metrics

Success criteria

Decision log

Negative results index

Current baseline

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Config	d512 (2.19e17)	d768 (1.70e18)
standard residual (baseline)	from compute-optimal sweep	from compute-optimal sweep
layer-grain prediction	gate 1	gate 1

Agent MoE Experiment: Layer-grain prediction (weighted sum of per-layer diffs) #4807

Description

TL;DR

Scope

Experiment grid

Metrics

Success criteria

Decision log

Negative results index

Current baseline

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions