-
Notifications
You must be signed in to change notification settings - Fork 138
Add claude init method with per-layer fan-in initialization #598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>
Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>
Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>
dirkgr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only looks at the input dimension, but Xavier/Kaiming also takes into account the output dimension to keep the gradients in the window. Should we do this?
src/olmo_core/nn/transformer/init.py
Outdated
| # For claude, Q/K/V use 1/√d_model (ignores base std parameter) | ||
| std = d_model**-0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't assume that Q/K/V projections are square. They don't have to be. Input always has to be d_model, but output does not.
Actually, maybe that means it's fine then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it seems like this would be more robust if it just looked at the shape of the matrix. Is that available here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've updated the implementation to use the actual weight matrix shapes via w.in_features instead of assuming d_model. This is now consistent with how we handle feed-forward layers and properly handles cases where projections have different dimensions (e.g., GQA with different numbers of Q vs KV heads, or models with explicit head_dim that differs from d_model // n_heads). Commit: b76fa1a
Co-authored-by: dirkgr <920638+dirkgr@users.noreply.github.com>
The original spec calls for For transformers with residual connections, fan-in-only initialization (like Kaiming) is typically preferred because:
That said, if you'd prefer Xavier-style initialization that considers both fan-in and fan-out, I can update it to use |
|
| # Compute std for w3 initialization | ||
| if self == InitMethod.claude: | ||
| # For claude, w3 uses 1/√d_in where d_in = d_model | ||
| std = m.w3.in_features**-0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YashasSamaga , gate is initialized here
claudevariant to InitMethod enum with appropriate docstringinit_embeddings()for claude: std = 1.0, normal distributioninit_attention()for claude: Q/K/V with 1/√d_model, w_out with 1/√d_modelinit_feed_forward()for claude: w1/w3 with 1/√d_model, w2 with 1/√hidden_sizeinit_final_w_out()for claude: std = 1/√d_modelinit_feed_forward_moe()for claude: apply fan-in principle to MoE weightsOriginal prompt
This pull request was created from Copilot chat.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.