Agent MoE Experiment: MHA (no GQA)

## TL;DR
Test full multi-head attention (num_kv_heads = num_heads) vs baseline 4:1 GQA. Measures the quality cost of GQA at current compute budgets with the current recipe.

## User prompt
> follow agent.md and implement multi-head-attention, (no gqa)

## Scope
- **Parent**: #4281
- **Branch**: `moe_mha`
- **Sweep file**: `experiments/grug/moe/mha_sweep.py`
- **Prior work**: #4371 found ~+0.004 BPB per halving of KV heads on an older recipe

## Head counts
| d | GQA (baseline) heads/kv | MHA heads/kv |
|---|------------------------|-------------|
| 512 | 4/1 | 4/4 |
| 768 | 6/1 | 6/6 |

## Gate 1 runs (2 total)
| Dim | Budget | Status |
|-----|--------|--------|
| 512 | 2.19e17 | pending |
| 768 | 1.70e18 | pending |

## Decision log
_empty_

## Conclusion
_pending_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent MoE Experiment: MHA (no GQA) #5151

TL;DR

User prompt

Scope

Head counts

Gate 1 runs (2 total)

Decision log

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent MoE Experiment: MHA (no GQA) #5151

Description

TL;DR

User prompt

Scope

Head counts

Gate 1 runs (2 total)

Decision log

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions