[QST] Clarification on `mup_width_multiplier`, `base width`, and `init_std` Values

Hi, I have a couple of questions regarding the variable `mup_width_multiplier` and related initialization choices in the code:

Regarding `mup_width_multiplier`:
Based on the paper and code comments, it seems `mup_width_multiplier` should always be the width divided by the base width, where the base width is set to 256. However, I noticed that this multiplier is only set within the `.sh` training scripts, instead of being computed directly as `n_embd / 256` in the code. Is there a specific reason for this approach? Are there any corner cases or considerations that require handling this value outside the main code?

On the selection of `base` width and `init_std`:
The values for base width and the base standard deviation `init_std` (which is always set to 0.02) seem somewhat arbitrary. From my understanding, to keep the output of the layer normalized, `init_std` should be `1/sqrt(base_width)`, which would be 0.0625 in this case. Could you clarify the reasoning behind choosing 0.02 for `init_std` and the specific value for `base width`?

Any clarification on these points would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Clarification on `mup_width_multiplier`, `base width`, and `init_std` Values #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Clarification on mup_width_multiplier, base width, and init_std Values #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[QST] Clarification on `mup_width_multiplier`, `base width`, and `init_std` Values #8