Episode 2: Matmul - Heart of the Transformer

Back to Series Overview | Previous: Introduction | Next: RMSNorm and Softmax

Overview

Matrix multiplication is the single most frequent operation in a transformer - Bielik executes hundreds of matmuls per forward pass, accounting for roughly 80% of compute time. This episode builds a Triton matmul kernel from scratch and progressively optimizes it to match PyTorch/cuBLAS performance.

Topics Covered

Why matmul matters - the dominant operation in every transformer layer
Basic Triton matmul - block decomposition, pointer arithmetic with broadcasting, K-loop, boundary masking
GPU memory hierarchy - registers, shared memory (SRAM), L2 cache, global memory (HBM); speed differences up to 100x
Optimization 1: Grouped block ordering - processing blocks in super-groups of 8 for better L2 cache reuse
Optimization 2: Auto-tuning - @triton.autotune to automatically search optimal block sizes for different hardware
Optimization 3: Tensor Cores - switching from FP32 to BF16 to engage hardware matrix units; 16x throughput gain; FP32 accumulator for numerical stability
Optimization 4: Pipeline and occupancy - 5-stage pipeline to overlap loads with compute, 8 warps for better GPU occupancy

Relevant Code

Kernels

kernels/matmul/matmul_basic.py - naive implementation
kernels/matmul/matmul.py - Tensor Core optimized with auto-tuning

Benchmarks

benchmarks/matmul/benchmark_matmul.py - performance comparison script
benchmarks/matmul/verify_correctness.py - numerical correctness tests

Benchmarks

benchmarks/matmul/benchmark_matmul_guide.py

To run benchmarks use:

make benchmark-matmul

Results on my RTX 4060 Ti

References

Back to Series Overview | Previous: Introduction | Next: RMSNorm and Softmax

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Episode 2: Matmul - Heart of the Transformer

Overview

Topics Covered

Relevant Code

Kernels

Benchmarks

Benchmarks

Results on my RTX 4060 Ti

References

FilesExpand file tree

ep02-matmul.md

Latest commit

History

ep02-matmul.md

File metadata and controls

Episode 2: Matmul - Heart of the Transformer

Overview

Topics Covered

Relevant Code

Kernels

Benchmarks

Benchmarks

Results on my RTX 4060 Ti

References