-
Notifications
You must be signed in to change notification settings - Fork 180
Open
Description
Fast (cublas-level) GEMM is table stakes for high-performance.
Several optimizations are needed:
- Bring tensor core support to cuda (TMA needed here)
- Allow arrays of tensor-core registers to be loaded and matmulled
- Warp-specialization in the IR
Another short term fix is to rewrite cublas patterns to a specialized cublas ir op to immediately invoke those kernels, which unlocks good gemm perf in the short term while these steps are being implemented into the search space.
EricLBuehler
Metadata
Metadata
Assignees
Labels
No labels