Skip to content

[attention] Investigate overlapping matmul and softmax #91

@antiagainst

Description

@antiagainst

In Flash Attention 3 we see a technique to overlap matmul and softmax from different waves to maximize mfma utilization. We should consider how to use it for current attention. Need to understand hardware scheduler and see how to work with/around it, like using s_setprio instructions.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

Todo

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions