-
Notifications
You must be signed in to change notification settings - Fork 9
Attention to better MLPerf and beyond #108
Copy link
Copy link
Open
Description
raikonenfnu
opened on Nov 19, 2024
Issue body actions
- General Attention Health
- Modifying kWidth to maximize reads from shared memory
- Modifying kWidth S.T FP8 do not need trip to shared memory.
- Enable attention transposeV when possible (in progress)
- Dot slicing for better instruction scheduling
- Buffer loads for free masking and move K,V directly from global to shared memory
- Instruction scheduling / software pipelining to overlap MMA and softmax
- Prefetch/MultiBuffering
- Try dot3d/ single kernel split-K to get faster attention on decode phase
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels