CuTe-Gemm-Fp16

HGEMM for Ampere Architecture and above (sm >= 80). The basic idea is the same as my previous kernel, but instead of using the WMMA API, I used CuTe templates. This allows us to XOR swizzle shared memory and reduce bank conflicts. I have also written an epilogue for this kernel where it first permutates the output in shared memory so that we can use vectorized transfer back to global memory.

The benchmark was done on an NVIDIA A6000 (void hgemm<3,2> is our kernel). It outperforms some CUTLASS / cuBLASLt kernels in certain cases.

Case: M=N=K	Case: Large M with N=K=256

Compilation

For the benchmark, first compile hgemm.cu by running make. Then you may run python3 benchmark.py to execute the benchmark. Time measurement is done with Nsight Compute.

Note

The performance can still be improved by tuning the tile sizes and thread block swizzle patterns.

I have also attached my short note on layout algebra.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
external/include		external/include
img		img
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
benchmark.py		benchmark.py
cublaslt_hgemm.cuh		cublaslt_hgemm.cuh
hgemm.cu		hgemm.cu
notes_on_layouts.md		notes_on_layouts.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CuTe-Gemm-Fp16

Compilation

Note

About

Uh oh!

Releases

Packages

Languages

annp0/CUTE-GEMM-FP16

Folders and files

Latest commit

History

Repository files navigation

CuTe-Gemm-Fp16

Compilation

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages