Fast CUDA Kernels

High-performance matrix multiplication kernels written to push CUDA limits and learn deeply how modern GPU optimizations work.

Focus: GEMM implementations that exploit Tensor Cores and optimized classical GEMM patterns, aiming for maximum TFLOPS on NVIDIA GPUs.

Performance has been tested on NVIDIA Ampere RTX 3070 GPUs, with a maximum of 0.7x cuBLAS performance while maintaining perfect accuracy.

Key Implementations

`tc_matmul.cu` — Tensor Core GEMM

High-throughput matrix multiplication using WMMA / Tensor Core instructions.

Highlights

Direct use of Tensor Cores (fp16/bf16/tf32)
Tiled/shared memory loading optimized with 128-bit vectorization
Double buffering to hide memory latency
Asynchronous CUDA Pipelining for overlapping compute & data transfers between global/shared memory.
Very high TFLOPS utilization, with 28 TFLOPS on Ampere

Currently, the fastest single-file GEMM kernel in this repo.

`flash_attention.cu` — Flash Attention

Flash Attention is a high-performance implementation in CUDA of the Attention Algorithm, optimizing memory usage without sacrificing accuracy.

Verified accuracy against a naive attention kernel, demonstrating over 7x speedup.

Highlights

Implements the Flash Attention algorithm using tiled computation and online softmax
Uses shared memory to load Q, K, V tiles (size: BLOCK_SIZE × HEAD_DIM)
Processes attention in blocks to reduce memory I/O
Maintains running statistics (max m_i and sum l_i) for numerically stable softmax
Rescales outputs incrementally as new blocks are processed
Still trying to incorporate Tensor Core usage to squeeze out maximum TFLOPS from matrix multiplications

`blas_gemm.cu` — Classical CUDA Core GEMM

Heavily tuned non-Tensor-Core matrix multiply.

Inspired by this famous blog.

Highlights

Multi-level tiling (block → warp → thread)
Vectorized global → shared memory loads (float4 / float2)
Bank-conflict-free shared memory layouts
Aggressive unrolling + register blocking
FP32 & FP16 variants
Designed to reach close to peak non-TC performance on Ampere

Getting Started

To compile tc_matmul.cu, run (for NVIDIA Ampere GPUs):

nvcc -O3 -lcublas -use_fast_math -arch=sm_86 tc_matmul.cu -o tc_matmul

To compile flash_attention.cu, run (for NVIDIA Ampere GPUs):

nvcc -o flash_attn flash_attention.cu -std=c++11 -arch=sm_86

To compile blas_gemm.cu, run the regular nvcc command.

Other Files

sparse_gemm.cu is for my personal experimentation with cuSPARSELt, a library for sparse matrix multiplications.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
blas_gemm.cu		blas_gemm.cu
flash_attention.cu		flash_attention.cu
sparse_gemm.cu		sparse_gemm.cu
tc_matmul.cu		tc_matmul.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast CUDA Kernels

Key Implementations

`tc_matmul.cu` — Tensor Core GEMM

`flash_attention.cu` — Flash Attention

`blas_gemm.cu` — Classical CUDA Core GEMM

Getting Started

Other Files

About

Uh oh!

Releases

Packages

Languages

WilliamZhang20/Fast-CUDA-Kernels

Folders and files

Latest commit

History

Repository files navigation

Fast CUDA Kernels

Key Implementations

tc_matmul.cu — Tensor Core GEMM

flash_attention.cu — Flash Attention

blas_gemm.cu — Classical CUDA Core GEMM

Getting Started

Other Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`tc_matmul.cu` — Tensor Core GEMM

`flash_attention.cu` — Flash Attention

`blas_gemm.cu` — Classical CUDA Core GEMM

Packages