Skip to content

Agent MoE Experiment: Wide attention (1.5x heads, head_dim=128) #5047

@ClassicLarry

Description

@ClassicLarry

TL;DR

Increase number of attention heads by 1.5x while keeping head_dim at 128. Tests whether more attention capacity improves quality.

User prompt

follow agent.md and implement increased attn width. that is, instead of having number of attention heads be hidden_dim/128, make it 1.5*hidden_dim/128. Keep the head_dim at 128. So d768 will have 9 heads instead of 6. make sure GQA still works.

Scope

Head counts

d Current heads/kv 1.5x heads/kv
512 4/1 6/1
768 6/1 9/1
1024 8/2 12/3
1280 10/2 15/3

GQA 4:1 ratio preserved at all sizes.

Gate 1 runs (2 total)

Dim Budget Status
512 2.19e17 pending
768 1.70e18 pending

Decision log

empty

Conclusion

pending

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions