Skip to content

Conversation

@adamkarvonen
Copy link
Collaborator

3 changes here:

  • First, add an optional k-annealing schedule for TopK trainers. With this, K begins at d_model and is annealed to K over the first 10% of training. Multiple people have found a significant reduction in dead features, such as Llama Scope: https://arxiv.org/abs/2410.20526
  • Add Qwen3 submodule
  • Handle the remove_bos argument if the tokenizer does not contain a BOS token (such as Qwen3). In this case, we instead remove the first non-pad token of each sequence, as it still contains a high norm. For Qwen3-8B, the first non-pad token typically has a norm 100x the average.

@adamkarvonen adamkarvonen merged commit f4abe32 into main Aug 18, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants