Back to Series Overview | Previous: RoPE | Next: Feed-Forward Network
Standard attention materializes an O(N^2) score matrix in HBM, causing memory bottlenecks and out-of-memory errors on long sequences. Flash Attention keeps everything in fast on-chip SRAM using tiling and online softmax - never writing the full attention matrix to slow memory.
kernels/attention/flash_attention_simple.py— Flash Attention kernel
benchmarks/attention/benchmark_flash_attention.py- Triton vs PyTorch naive vs PyTorch+compile; two sweeps (seq_len, num_heads)
Back to Series Overview | Previous: RoPE | Next: Feed-Forward Network

