Skip to content

Latest commit

 

History

History
37 lines (24 loc) · 1.8 KB

File metadata and controls

37 lines (24 loc) · 1.8 KB

Episode 4: RoPE — Rotary Position Embeddings

Back to Series Overview | Previous: RMSNorm and Softmax | Next: Flash Attention


Episode 4: RoPE

Overview

Without position encoding, a Transformer treats its input as an unordered set - "Dog bites man" and "Man bites dog" produce identical attention scores for the same token pairs. This episode dives deep into Rotary Position Embeddings (RoPE): the math behind rotating Q and K vectors, where and how RoPE is applied in Bielik's architecture, and how to build an optimized single-pass Triton kernel that eliminates all transcendental operations from the hot path using a precomputed cos/sin cache.

Relevant Code

Kernels

Benchmarks

Results on my RTX 4060 Ti

rope-cached-bandwidth-vs-heads

rope-cached-bandwidth-vs-seq-len


Back to Series Overview | Previous: RMSNorm and Softmax | Next: Flash Attention