-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Hi, I have a couple of questions regarding the variable mup_width_multiplier and related initialization choices in the code:
Regarding mup_width_multiplier:
Based on the paper and code comments, it seems mup_width_multiplier should always be the width divided by the base width, where the base width is set to 256. However, I noticed that this multiplier is only set within the .sh training scripts, instead of being computed directly as n_embd / 256 in the code. Is there a specific reason for this approach? Are there any corner cases or considerations that require handling this value outside the main code?
On the selection of base width and init_std:
The values for base width and the base standard deviation init_std (which is always set to 0.02) seem somewhat arbitrary. From my understanding, to keep the output of the layer normalized, init_std should be 1/sqrt(base_width), which would be 0.0625 in this case. Could you clarify the reasoning behind choosing 0.02 for init_std and the specific value for base width?
Any clarification on these points would be greatly appreciated!