Fast GEMM

Fast (cublas-level) GEMM is table stakes for high-performance.

Several optimizations are needed:
- [ ] Bring tensor core support to cuda (TMA needed here)
- [ ] Allow arrays of tensor-core registers to be loaded and matmulled
- [ ] Warp-specialization in the IR

Another short term fix is to rewrite cublas patterns to a specialized cublas ir op to immediately invoke those kernels, which unlocks good gemm perf in the short term while these steps are being implemented into the search space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast GEMM #145

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fast GEMM #145

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions