Episode 5: Flash Attention v2

Back to Series Overview | Previous: RoPE | Next: Feed-Forward Network

Overview

Standard attention materializes an O(N^2) score matrix in HBM, causing memory bottlenecks and out-of-memory errors on long sequences. Flash Attention keeps everything in fast on-chip SRAM using tiling and online softmax - never writing the full attention matrix to slow memory.

Relevant Code

Kernels

kernels/attention/flash_attention_simple.py — Flash Attention kernel

Benchmarks

benchmarks/attention/benchmark_flash_attention.py - Triton vs PyTorch naive vs PyTorch+compile; two sweeps (seq_len, num_heads)

Results on my RTX 4060 Ti

---

Back to Series Overview | Previous: RoPE | Next: Feed-Forward Network

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Episode 5: Flash Attention v2

Overview

Relevant Code

Kernels

Benchmarks

Results on my RTX 4060 Ti

FilesExpand file tree

ep05-flash-attention.md

Latest commit

History

ep05-flash-attention.md

File metadata and controls

Episode 5: Flash Attention v2

Overview

Relevant Code

Kernels

Benchmarks

Results on my RTX 4060 Ti