Skip to content

[QST] Clarification on mup_width_multiplier, base width, and init_std Values #8

@Yanksi

Description

@Yanksi

Hi, I have a couple of questions regarding the variable mup_width_multiplier and related initialization choices in the code:

Regarding mup_width_multiplier:
Based on the paper and code comments, it seems mup_width_multiplier should always be the width divided by the base width, where the base width is set to 256. However, I noticed that this multiplier is only set within the .sh training scripts, instead of being computed directly as n_embd / 256 in the code. Is there a specific reason for this approach? Are there any corner cases or considerations that require handling this value outside the main code?

On the selection of base width and init_std:
The values for base width and the base standard deviation init_std (which is always set to 0.02) seem somewhat arbitrary. From my understanding, to keep the output of the layer normalized, init_std should be 1/sqrt(base_width), which would be 0.0625 in this case. Could you clarify the reasoning behind choosing 0.02 for init_std and the specific value for base width?

Any clarification on these points would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions