Agent MoE Experiment: Wide attention (1.5x heads, head_dim=128)

## TL;DR
Increase number of attention heads by 1.5x while keeping head_dim at 128. Tests whether more attention capacity improves quality.

## User prompt
> follow agent.md and implement increased attn width. that is, instead of having number of attention heads be hidden_dim/128, make it 1.5*hidden_dim/128. Keep the head_dim at 128. So d768 will have 9 heads instead of 6. make sure GQA still works.

## Scope
- **Parent**: #4281
- **Branch**: `moe_wide_attn`
- **Sweep file**: `experiments/grug/moe/wide_attn_sweep.py`

## Head counts
| d | Current heads/kv | 1.5x heads/kv |
|---|-----------------|---------------|
| 512 | 4/1 | 6/1 |
| 768 | 6/1 | 9/1 |
| 1024 | 8/2 | 12/3 |
| 1280 | 10/2 | 15/3 |

GQA 4:1 ratio preserved at all sizes.

## Gate 1 runs (2 total)
| Dim | Budget | Status |
|-----|--------|--------|
| 512 | 2.19e17 | pending |
| 768 | 1.70e18 | pending |

## Decision log
_empty_

## Conclusion
_pending_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent MoE Experiment: Wide attention (1.5x heads, head_dim=128) #5047

TL;DR

User prompt

Scope

Head counts

Gate 1 runs (2 total)

Decision log

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent MoE Experiment: Wide attention (1.5x heads, head_dim=128) #5047

Description

TL;DR

User prompt

Scope

Head counts

Gate 1 runs (2 total)

Decision log

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions