Add claude init method with per-layer fan-in initialization #598

Copilot · 2026-02-05T21:30:45Z

Original prompt

Background

The current OLMo 3 models use a flat init_std=0.02 for all weight matrices (the normal InitMethod). There's also an experimental "Dirk" init style in src/scripts/train/ladder/2026Q1/init_style_ladder.py that sets init_std = sqrt(1 / d_model) and embedding_init_std = 1.0 on the TransformerConfig.

The "Dirk" init is a uniform approximation to variance-preserving initialization — it uses the same std for all layers. We want to add a new init method called claude that takes this further by using per-layer 1/√d_in for each weight matrix, where d_in is the fan-in of that specific layer.

What to implement

Add a new InitMethod variant called claude in src/olmo_core/nn/transformer/init.py, modeled after the existing init methods (normal, llama, llama_depth, normalized). The claude init method should:

Embeddings: Initialize with std = 1.0 (similar to the Dirk init's embedding_init_std = 1.0). Use normal distribution (not truncated).

Linear layers (_init_linear): Use std = 1/√d_in where d_in is the fan-in (number of input features) of the linear layer. Use truncated normal with bounds [-3*std, 3*std], same as the existing _init_linear method.

Attention Q/K/V projections: std = 1/√d_model (since d_in = d_model for these projections).

Attention output projection (w_out): std = 1/√d_head where d_head can be inferred from the weight shape (since d_in = d_head per head, though the weight matrix has d_in = d_model due to concatenation of heads — so use 1/√d_model here as well for the fused weight matrix).

Feed-forward w1 and w3: std = 1/√d_model (d_in = d_model).

Feed-forward w2: std = 1/√hidden_size where hidden_size is the FFN intermediate dimension. This can be inferred from m.w2.in_features or the weight shape.

Final LM head (w_out): std = 1/√d_model.

MoE layers: Apply the same principle — use 1/√d_in based on each weight's fan-in dimension.

The key difference from the existing normal init (flat 0.02) and from the "Dirk" init (flat 1/√d_model) is that claude should use the actual fan-in of each specific weight matrix rather than a single global std. For example, the FFN down-projection w2 has d_in = hidden_size which is much larger than d_model, so it should get a smaller std than the up-projections.

Implementation approach

Looking at the existing InitMethod enum methods:

init_embeddings(): When self == InitMethod.claude, use nn.init.normal_ with std=1.0 (not dependent on the std parameter passed in).

init_attention(): When self == InitMethod.claude, compute the appropriate std from d_model for Q/K/V, and from the w_out weight's in_features for the output projection.

init_feed_forward(): When self == InitMethod.claude, use 1/√(m.w1.in_features) for w1, 1/√(m.w3.in_features) for w3, and 1/√(m.w2.in_features) for w2.

init_final_w_out(): When self == InitMethod.claude, use std = 1/√d_model (same as d_model**-0.5).

init_feed_forward_moe(): Apply the same fan-in principle to MoE weight matrices.

Make sure to add appropriate docstring for the new claude variant in the InitMethod enum.

This pull request was created from Copilot chat.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>

dirkgr

This only looks at the input dimension, but Xavier/Kaiming also takes into account the output dimension to keep the gradients in the window. Should we do this?

dirkgr · 2026-02-05T22:15:12Z

src/olmo_core/nn/transformer/init.py

+            # For claude, Q/K/V use 1/√d_model (ignores base std parameter)
+            std = d_model**-0.5


Don't assume that Q/K/V projections are square. They don't have to be. Input always has to be d_model, but output does not.

Actually, maybe that means it's fine then.

Though it seems like this would be more robust if it just looked at the shape of the matrix. Is that available here?

Good point! I've updated the implementation to use the actual weight matrix shapes via w.in_features instead of assuming d_model. This is now consistent with how we handle feed-forward layers and properly handles cases where projections have different dimensions (e.g., GQA with different numbers of Q vs KV heads, or models with explicit head_dim that differs from d_model // n_heads). Commit: b76fa1a

Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>

Copilot · 2026-02-05T22:21:36Z

This only looks at the input dimension, but Xavier/Kaiming also takes into account the output dimension to keep the gradients in the window. Should we do this?

The original spec calls for std = 1/√d_in (fan-in only), which matches Kaiming initialization for linear activations. Xavier uses sqrt(2 / (fan_in + fan_out)) which considers both dimensions.

For transformers with residual connections, fan-in-only initialization (like Kaiming) is typically preferred because:

It preserves variance in the forward pass
The residual connections help with gradient flow, reducing the need for symmetric initialization
It's simpler and matches the "Dirk" init approach mentioned in the spec

That said, if you'd prefer Xavier-style initialization that considers both fan-in and fan-out, I can update it to use std = sqrt(2 / (d_in + d_out)) or std = sqrt(1 / (d_in + d_out)) for linear activations. Let me know!

dirkgr · 2026-02-10T00:35:36Z

does this init the gate projection?
I think there is something wrong in conjunciton with embedding_scaling

dirkgr · 2026-02-10T07:19:09Z

src/olmo_core/nn/transformer/init.py

+        # Compute std for w3 initialization
+        if self == InitMethod.claude:
+            # For claude, w3 uses 1/√d_in where d_in = d_model
+            std = m.w3.in_features**-0.5


@YashasSamaga , gate is initialized here

Initial plan

6c2bab8

Copilot AI assigned Copilot and dirkgr Feb 5, 2026

Copilot started work on behalf of dirkgr February 5, 2026 21:31 View session

Copilot AI and others added 3 commits February 5, 2026 21:34

Add claude init method implementation

0f951a0

Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>

Format code with black and add tests

e1b5c78

Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>

Refactor claude init for better clarity and address code review comments

905c5bb

Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add new claude initialization method for OLMo models~~ Add claude init method with per-layer fan-in initialization Feb 5, 2026

Copilot AI requested a review from dirkgr February 5, 2026 21:44

Copilot finished work on behalf of dirkgr February 5, 2026 21:44

dirkgr requested changes Feb 5, 2026

View reviewed changes

Copilot started work on behalf of dirkgr February 5, 2026 22:17 View session

Use actual weight shapes for attention init instead of assuming d_model

b76fa1a

Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>

Copilot finished work on behalf of dirkgr February 5, 2026 22:22

Copilot AI requested a review from dirkgr February 5, 2026 22:22

dirkgr reviewed Feb 10, 2026

View reviewed changes

dirkgr added 2 commits February 9, 2026 23:38

Rename the method

aa7a4ab

Don't have giant embeddings after scaling.

a2cb7d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add claude init method with per-layer fan-in initialization #598

Add claude init method with per-layer fan-in initialization #598

Uh oh!

Copilot AI commented Feb 5, 2026 •

edited

Loading

Uh oh!

dirkgr left a comment

Uh oh!

dirkgr Feb 5, 2026

Uh oh!

dirkgr Feb 5, 2026

Uh oh!

Copilot AI Feb 5, 2026

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

dirkgr commented Feb 10, 2026 •

edited

Loading

Uh oh!

dirkgr Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# For claude, Q/K/V use 1/√d_model (ignores base std parameter)
		std = d_model**-0.5

Add claude init method with per-layer fan-in initialization #598

Are you sure you want to change the base?

Add claude init method with per-layer fan-in initialization #598

Uh oh!

Conversation

Copilot AI commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

What to implement

Implementation approach

Uh oh!

dirkgr left a comment

Choose a reason for hiding this comment

Uh oh!

dirkgr Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

dirkgr Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

dirkgr commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dirkgr Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 5, 2026 •

edited

Loading

dirkgr commented Feb 10, 2026 •

edited

Loading