Skip to content

Conversation

@tyler-romero
Copy link
Contributor

@tyler-romero tyler-romero commented Jan 16, 2026

Fitting of scaling laws follows the original chinchilla methodology + known helpful improvements closely.

Important points:

  1. It seems many scaling law papers dont do a full grid search over initialization params, which can lead to suboptimal fits.
  2. Since chinchilla scaling laws are intended to fit the best-achievable loss for a given N, D or C, using overestimate_penalty=10 can be helpful to compensate against undertuned ladder rungs. Empirically I have found that using this leads to better predictions of held out points when doing a rollout over increasing N.
  3. Bootstrapping is useful to get a feel for the plausible variance in predictions based on minor differences in underlying data but I would not consider it robust yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants