TL;DR
Increase number of attention heads by 1.5x while keeping head_dim at 128. Tests whether more attention capacity improves quality.
User prompt
follow agent.md and implement increased attn width. that is, instead of having number of attention heads be hidden_dim/128, make it 1.5*hidden_dim/128. Keep the head_dim at 128. So d768 will have 9 heads instead of 6. make sure GQA still works.
Scope
Head counts
| d |
Current heads/kv |
1.5x heads/kv |
| 512 |
4/1 |
6/1 |
| 768 |
6/1 |
9/1 |
| 1024 |
8/2 |
12/3 |
| 1280 |
10/2 |
15/3 |
GQA 4:1 ratio preserved at all sizes.
Gate 1 runs (2 total)
| Dim |
Budget |
Status |
| 512 |
2.19e17 |
pending |
| 768 |
1.70e18 |
pending |
Decision log
empty
Conclusion
pending
TL;DR
Increase number of attention heads by 1.5x while keeping head_dim at 128. Tests whether more attention capacity improves quality.
User prompt
Scope
moe_wide_attnexperiments/grug/moe/wide_attn_sweep.pyHead counts
GQA 4:1 ratio preserved at all sizes.
Gate 1 runs (2 total)
Decision log
empty
Conclusion
pending