Back to Series Overview | Previous: Introduction | Next: RMSNorm and Softmax
Matrix multiplication is the single most frequent operation in a transformer - Bielik executes hundreds of matmuls per forward pass, accounting for roughly 80% of compute time. This episode builds a Triton matmul kernel from scratch and progressively optimizes it to match PyTorch/cuBLAS performance.
- Why matmul matters - the dominant operation in every transformer layer
- Basic Triton matmul - block decomposition, pointer arithmetic with broadcasting, K-loop, boundary masking
- GPU memory hierarchy - registers, shared memory (SRAM), L2 cache, global memory (HBM); speed differences up to 100x
- Optimization 1: Grouped block ordering - processing blocks in super-groups of 8 for better L2 cache reuse
- Optimization 2: Auto-tuning -
@triton.autotuneto automatically search optimal block sizes for different hardware - Optimization 3: Tensor Cores - switching from FP32 to BF16 to engage hardware matrix units; 16x throughput gain; FP32 accumulator for numerical stability
- Optimization 4: Pipeline and occupancy - 5-stage pipeline to overlap loads with compute, 8 warps for better GPU occupancy
kernels/matmul/matmul_basic.py- naive implementationkernels/matmul/matmul.py- Tensor Core optimized with auto-tuning
benchmarks/matmul/benchmark_matmul.py- performance comparison scriptbenchmarks/matmul/verify_correctness.py- numerical correctness tests
To run benchmarks use:
make benchmark-matmulBack to Series Overview | Previous: Introduction | Next: RMSNorm and Softmax

