Topk annealing #50

adamkarvonen · 2025-08-18T22:03:07Z

3 changes here:

First, add an optional k-annealing schedule for TopK trainers. With this, K begins at d_model and is annealed to K over the first 10% of training. Multiple people have found a significant reduction in dead features, such as Llama Scope: https://arxiv.org/abs/2410.20526
Add Qwen3 submodule
Handle the remove_bos argument if the tokenizer does not contain a BOS token (such as Qwen3). In this case, we instead remove the first non-pad token of each sequence, as it still contains a high norm. For Qwen3-8B, the first non-pad token typically has a norm 100x the average.

adamkarvonen added 3 commits August 18, 2025 21:46

Add optional k-annealing

ee08651

Add qwen3

c3e4bf4

Handle remove_bos if tokenizer does not contain a bos token

8f87957

adamkarvonen merged commit f4abe32 into main Aug 18, 2025
3 checks passed

Provide feedback