Skip to content

Latest commit

 

History

History
36 lines (24 loc) · 1.5 KB

File metadata and controls

36 lines (24 loc) · 1.5 KB

Episode 5: Flash Attention v2

Back to Series Overview | Previous: RoPE | Next: Feed-Forward Network


Episode 5: Flash Attention

Overview

Standard attention materializes an O(N^2) score matrix in HBM, causing memory bottlenecks and out-of-memory errors on long sequences. Flash Attention keeps everything in fast on-chip SRAM using tiling and online softmax - never writing the full attention matrix to slow memory.

Relevant Code

Kernels

Benchmarks

Results on my RTX 4060 Ti

flash-attention-tflops-vs-heads

flash-attention-tflops-vs-seq-len

---

Back to Series Overview | Previous: RoPE | Next: Feed-Forward Network