Episode 4: RoPE — Rotary Position Embeddings

Back to Series Overview | Previous: RMSNorm and Softmax | Next: Flash Attention

Overview

Without position encoding, a Transformer treats its input as an unordered set - "Dog bites man" and "Man bites dog" produce identical attention scores for the same token pairs. This episode dives deep into Rotary Position Embeddings (RoPE): the math behind rotating Q and K vectors, where and how RoPE is applied in Bielik's architecture, and how to build an optimized single-pass Triton kernel that eliminates all transcendental operations from the hot path using a precomputed cos/sin cache.

Relevant Code

Kernels

kernels/attention/rope_cached.py - optimised single-pass cached RoPE kernel (rope_cached_kernel, build_rope_cache, apply_rope_cached_)

Benchmarks

benchmarks/attention/benchmark_rope_cached.py - Triton vs PyTorch naive vs PyTorch+compile; two sweeps (seq_len, num_heads)

Results on my RTX 4060 Ti

Back to Series Overview | Previous: RMSNorm and Softmax | Next: Flash Attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Episode 4: RoPE — Rotary Position Embeddings

Overview

Relevant Code

Kernels

Benchmarks

Results on my RTX 4060 Ti

FilesExpand file tree

ep04-rope.md

Latest commit

History

ep04-rope.md

File metadata and controls

Episode 4: RoPE — Rotary Position Embeddings

Overview

Relevant Code

Kernels

Benchmarks

Results on my RTX 4060 Ti