Episode 3: Just Fuse It — RMSNorm and Softmax

Back to Series Overview | Previous: Matmul | Next: RoPE and Attention

Overview

RMSNorm runs 65 times per Bielik forward pass; softmax runs 32 times. Individually they look cheap - but they share a bottleneck that has nothing to do with compute: the memory wall. This episode introduces kernel fusion as the primary optimization technique for memory-bound operations, builds fused Triton kernels for RMSNorm and softmax (with causal mask), and compares them against PyTorch's own fused implementations.

Key insight: Kernel fusion doesn't reduce computation - it reduces data movement.

Topics Covered

The Memory Wall - why GPUs sit idle 98% of the time on element-wise ops: data starvation, not lack of cores
Kernel Fusion - what fusion means: combine sequential kernels -> data stays in fast registers/SRAM instead of bouncing through slow global memory
Fused Single-Pass RMSNorm Kernel
Softmax: Single-Pass with Causal Mask