Back to Series Overview | Previous: Matmul | Next: RoPE and Attention
RMSNorm runs 65 times per Bielik forward pass; softmax runs 32 times. Individually they look cheap - but they share a bottleneck that has nothing to do with compute: the memory wall. This episode introduces kernel fusion as the primary optimization technique for memory-bound operations, builds fused Triton kernels for RMSNorm and softmax (with causal mask), and compares them against PyTorch's own fused implementations.
Key insight: Kernel fusion doesn't reduce computation - it reduces data movement.
-
The Memory Wall - why GPUs sit idle 98% of the time on element-wise ops: data starvation, not lack of cores
-
Kernel Fusion - what fusion means: combine sequential kernels -> data stays in fast registers/SRAM instead of bouncing through slow global memory
-
Fused Single-Pass RMSNorm Kernel
-
Softmax: Single-Pass with Causal Mask
-
kernels/normalization/rms_norm_simple.py- single-pass fused RMSNorm -
kernels/attention/softmax_causal_simple.py- single-pass fused softmax + causal mask
To run benchmarks use:
make benchmark-rms-normTo run benchmarks use:
make benchmark-softmaxBack to Series Overview | Previous: Matmul | Next: RoPE and Attention





