Skip to content

Latest commit

 

History

History
92 lines (57 loc) · 3.09 KB

File metadata and controls

92 lines (57 loc) · 3.09 KB

Episode 3: Just Fuse It — RMSNorm and Softmax

Back to Series Overview | Previous: Matmul | Next: RoPE and Attention


Episode 3: Fused RMSNorm & SoftMax

Overview

RMSNorm runs 65 times per Bielik forward pass; softmax runs 32 times. Individually they look cheap - but they share a bottleneck that has nothing to do with compute: the memory wall. This episode introduces kernel fusion as the primary optimization technique for memory-bound operations, builds fused Triton kernels for RMSNorm and softmax (with causal mask), and compares them against PyTorch's own fused implementations.

Key insight: Kernel fusion doesn't reduce computation - it reduces data movement.


Topics Covered

  • The Memory Wall - why GPUs sit idle 98% of the time on element-wise ops: data starvation, not lack of cores

  • Kernel Fusion - what fusion means: combine sequential kernels -> data stays in fast registers/SRAM instead of bouncing through slow global memory

  • Fused Single-Pass RMSNorm Kernel

  • Softmax: Single-Pass with Causal Mask


Relevant Code

Kernels


Benchmarks

To run benchmarks use:

make benchmark-rms-norm

To run benchmarks use:

make benchmark-softmax

Results on my RTX 4060 Ti

RMSNorm

rmsnorm-bandwidth-vs-hidden-size

rmsnorm-bandwidth-vs-rows

rms_norm-summary-bielik-config-tflops

Softmax

softmax-causal-bandwidth-vs-heads

softmax-causal-bandwidth-vs-seq-len

softmax_causal-summary-bielik-config-tflops


Back to Series Overview | Previous: Matmul | Next: RoPE and Attention