Skip to content

Fast GEMM #145

@jafioti

Description

@jafioti

Fast (cublas-level) GEMM is table stakes for high-performance.

Several optimizations are needed:

  • Bring tensor core support to cuda (TMA needed here)
  • Allow arrays of tensor-core registers to be loaded and matmulled
  • Warp-specialization in the IR

Another short term fix is to rewrite cublas patterns to a specialized cublas ir op to immediately invoke those kernels, which unlocks good gemm perf in the short term while these steps are being implemented into the search space.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions