Attention to better MLPerf and beyond

1. General Attention Health
- [x] Modifying kWidth to maximize reads from shared memory
- [x] Modifying kWidth S.T FP8 do not need trip to shared memory.
- [ ] Enable attention transposeV when possible (in progress)
- [ ] Dot slicing for better instruction scheduling
- [ ] Buffer loads for free masking and move K,V directly from global to shared memory
- [ ] Instruction scheduling / software pipelining to overlap MMA and softmax
- [ ] Prefetch/MultiBuffering
- [ ] Try dot3d/ single kernel split-K to get faster attention on decode phase



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention to better MLPerf and beyond #108

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attention to better MLPerf and beyond #108

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions