Skip to content

Agent MoE Experiment: Layer-grain prediction (weighted sum of per-layer diffs) #4807

@ClassicLarry

Description

@ClassicLarry

TL;DR

Test replacing the standard residual stream with a weighted sum of per-layer contributions. Instead of the final hidden state, the output before final norm is λ0 * x0 + Σ(λi * hidden_diff_i) where x0 is the initial post-embedding state and hidden_diff_i is the residual contribution of layer i. Each λi is a unique learnable scalar (visible in wandb for analysis).

Parent tracking issue: #4281

User prompt: "try a feature called layer_grain_prediction. That is, lets save a lambda parameter for each layer, and compute hidden_new = block(hidden); hidden_diff=hidden_new-hidden;hidden=hidden_new;hidden_diffs.append(hidden_diff). then our output before the final norm is sum over [lambda_ihidden_diffs[i]]. I need lambda_i to be a unique parameter for each layer, so I can see the norms/values in wandb automatically. also let the final prediction include x0. so x0_lambdax0 + (everything else)"

Scope

  • Baseline: standard residual stream (final hidden state = x0 + sum of all layer residuals)
  • Variant: output = λ0 * x0 + Σ(λi * diff_i) with learnable per-layer λi (init 1.0) and λ0 (init 1.0)

All other settings unchanged (E=64, K=4, full gate, same optimizer/schedule).

Experiment grid

Config d512 (2.19e17) d768 (1.70e18)
standard residual (baseline) from compute-optimal sweep from compute-optimal sweep
layer-grain prediction gate 1 gate 1

2 new runs total.

Metrics

  • eval/paloma/macro_loss (final)
  • throughput/tokens_per_second (avg last 100 steps)
  • throughput/total_tokens (final)
  • Per-layer λi values (visible as param norms in wandb)

Success criteria

Effective speedup > 1 at both gate 1 scales (per experiments/grug/moe/agent.md).

Decision log

(to be updated)

Negative results index

(to be updated)

Current baseline

Standard residual stream, compute-optimal sweep (wandb group compute-optimal-sweep).

Conclusion

(pending)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions