Skip to content

Conversation

Copy link

Copilot AI commented Feb 5, 2026

  • Add claude variant to InitMethod enum with appropriate docstring
  • Implement init_embeddings() for claude: std = 1.0, normal distribution
  • Implement init_attention() for claude: Q/K/V with 1/√d_model, w_out with 1/√d_model
  • Implement init_feed_forward() for claude: w1/w3 with 1/√d_model, w2 with 1/√hidden_size
  • Implement init_final_w_out() for claude: std = 1/√d_model
  • Implement init_feed_forward_moe() for claude: apply fan-in principle to MoE weights
  • Create tests to validate the claude init method
  • Run tests to ensure correctness
  • Run linters and type checks
  • Address code review comments
  • Use actual weight matrix shapes instead of assuming d_model for attention layers
Original prompt

Background

The current OLMo 3 models use a flat init_std=0.02 for all weight matrices (the normal InitMethod). There's also an experimental "Dirk" init style in src/scripts/train/ladder/2026Q1/init_style_ladder.py that sets init_std = sqrt(1 / d_model) and embedding_init_std = 1.0 on the TransformerConfig.

The "Dirk" init is a uniform approximation to variance-preserving initialization — it uses the same std for all layers. We want to add a new init method called claude that takes this further by using per-layer 1/√d_in for each weight matrix, where d_in is the fan-in of that specific layer.

What to implement

Add a new InitMethod variant called claude in src/olmo_core/nn/transformer/init.py, modeled after the existing init methods (normal, llama, llama_depth, normalized). The claude init method should:

  1. Embeddings: Initialize with std = 1.0 (similar to the Dirk init's embedding_init_std = 1.0). Use normal distribution (not truncated).

  2. Linear layers (_init_linear): Use std = 1/√d_in where d_in is the fan-in (number of input features) of the linear layer. Use truncated normal with bounds [-3*std, 3*std], same as the existing _init_linear method.

  3. Attention Q/K/V projections: std = 1/√d_model (since d_in = d_model for these projections).

  4. Attention output projection (w_out): std = 1/√d_head where d_head can be inferred from the weight shape (since d_in = d_head per head, though the weight matrix has d_in = d_model due to concatenation of heads — so use 1/√d_model here as well for the fused weight matrix).

  5. Feed-forward w1 and w3: std = 1/√d_model (d_in = d_model).

  6. Feed-forward w2: std = 1/√hidden_size where hidden_size is the FFN intermediate dimension. This can be inferred from m.w2.in_features or the weight shape.

  7. Final LM head (w_out): std = 1/√d_model.

  8. MoE layers: Apply the same principle — use 1/√d_in based on each weight's fan-in dimension.

The key difference from the existing normal init (flat 0.02) and from the "Dirk" init (flat 1/√d_model) is that claude should use the actual fan-in of each specific weight matrix rather than a single global std. For example, the FFN down-projection w2 has d_in = hidden_size which is much larger than d_model, so it should get a smaller std than the up-projections.

Implementation approach

Looking at the existing InitMethod enum methods:

  • init_embeddings(): When self == InitMethod.claude, use nn.init.normal_ with std=1.0 (not dependent on the std parameter passed in).
  • init_attention(): When self == InitMethod.claude, compute the appropriate std from d_model for Q/K/V, and from the w_out weight's in_features for the output projection.
  • init_feed_forward(): When self == InitMethod.claude, use 1/√(m.w1.in_features) for w1, 1/√(m.w3.in_features) for w3, and 1/√(m.w2.in_features) for w2.
  • init_final_w_out(): When self == InitMethod.claude, use std = 1/√d_model (same as d_model**-0.5).
  • init_feed_forward_moe(): Apply the same fan-in principle to MoE weight matrices.

Make sure to add appropriate docstring for the new claude variant in the InitMethod enum.

This pull request was created from Copilot chat.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 3 commits February 5, 2026 21:34
Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>
Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>
Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>
Copilot AI changed the title [WIP] Add new claude initialization method for OLMo models Add claude init method with per-layer fan-in initialization Feb 5, 2026
Copilot AI requested a review from dirkgr February 5, 2026 21:44
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only looks at the input dimension, but Xavier/Kaiming also takes into account the output dimension to keep the gradients in the window. Should we do this?

Comment on lines 131 to 132
# For claude, Q/K/V use 1/√d_model (ignores base std parameter)
std = d_model**-0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't assume that Q/K/V projections are square. They don't have to be. Input always has to be d_model, but output does not.

Actually, maybe that means it's fine then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it seems like this would be more robust if it just looked at the shape of the matrix. Is that available here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've updated the implementation to use the actual weight matrix shapes via w.in_features instead of assuming d_model. This is now consistent with how we handle feed-forward layers and properly handles cases where projections have different dimensions (e.g., GQA with different numbers of Q vs KV heads, or models with explicit head_dim that differs from d_model // n_heads). Commit: b76fa1a

Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>
Copy link
Author

Copilot AI commented Feb 5, 2026

This only looks at the input dimension, but Xavier/Kaiming also takes into account the output dimension to keep the gradients in the window. Should we do this?

The original spec calls for std = 1/√d_in (fan-in only), which matches Kaiming initialization for linear activations. Xavier uses sqrt(2 / (fan_in + fan_out)) which considers both dimensions.

For transformers with residual connections, fan-in-only initialization (like Kaiming) is typically preferred because:

  1. It preserves variance in the forward pass
  2. The residual connections help with gradient flow, reducing the need for symmetric initialization
  3. It's simpler and matches the "Dirk" init approach mentioned in the spec

That said, if you'd prefer Xavier-style initialization that considers both fan-in and fan-out, I can update it to use std = sqrt(2 / (d_in + d_out)) or std = sqrt(1 / (d_in + d_out)) for linear activations. Let me know!

Copilot AI requested a review from dirkgr February 5, 2026 22:22
@dirkgr
Copy link
Member

dirkgr commented Feb 10, 2026

  • does this init the gate projection?
  • I think there is something wrong in conjunciton with embedding_scaling

# Compute std for w3 initialization
if self == InitMethod.claude:
# For claude, w3 uses 1/√d_in where d_in = d_model
std = m.w3.in_features**-0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YashasSamaga , gate is initialized here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants