TL;DR
Test full multi-head attention (num_kv_heads = num_heads) vs baseline 4:1 GQA. Measures the quality cost of GQA at current compute budgets with the current recipe.
User prompt
follow agent.md and implement multi-head-attention, (no gqa)
Scope
Head counts
| d |
GQA (baseline) heads/kv |
MHA heads/kv |
| 512 |
4/1 |
4/4 |
| 768 |
6/1 |
6/6 |
Gate 1 runs (2 total)
| Dim |
Budget |
Status |
| 512 |
2.19e17 |
pending |
| 768 |
1.70e18 |
pending |
Decision log
empty
Conclusion
pending
TL;DR
Test full multi-head attention (num_kv_heads = num_heads) vs baseline 4:1 GQA. Measures the quality cost of GQA at current compute budgets with the current recipe.
User prompt
Scope
moe_mhaexperiments/grug/moe/mha_sweep.pyHead counts
Gate 1 runs (2 total)
Decision log
empty
Conclusion
pending