Skip to content

Latest commit

 

History

History
65 lines (43 loc) · 2.89 KB

File metadata and controls

65 lines (43 loc) · 2.89 KB

Episode 2: Matmul - Heart of the Transformer

Back to Series Overview | Previous: Introduction | Next: RMSNorm and Softmax


Episode 2: Matmul

Overview

Matrix multiplication is the single most frequent operation in a transformer - Bielik executes hundreds of matmuls per forward pass, accounting for roughly 80% of compute time. This episode builds a Triton matmul kernel from scratch and progressively optimizes it to match PyTorch/cuBLAS performance.

Topics Covered

  • Why matmul matters - the dominant operation in every transformer layer
  • Basic Triton matmul - block decomposition, pointer arithmetic with broadcasting, K-loop, boundary masking
  • GPU memory hierarchy - registers, shared memory (SRAM), L2 cache, global memory (HBM); speed differences up to 100x
  • Optimization 1: Grouped block ordering - processing blocks in super-groups of 8 for better L2 cache reuse
  • Optimization 2: Auto-tuning - @triton.autotune to automatically search optimal block sizes for different hardware
  • Optimization 3: Tensor Cores - switching from FP32 to BF16 to engage hardware matrix units; 16x throughput gain; FP32 accumulator for numerical stability
  • Optimization 4: Pipeline and occupancy - 5-stage pipeline to overlap loads with compute, 8 warps for better GPU occupancy

Relevant Code

Kernels

Benchmarks

Benchmarks

To run benchmarks use:

make benchmark-matmul

Results on my RTX 4060 Ti

tflops vs size

tflops vs size summary

References


Back to Series Overview | Previous: Introduction | Next: RMSNorm and Softmax