Back to Series Overview | Previous: RMSNorm and Softmax | Next: Flash Attention
Without position encoding, a Transformer treats its input as an unordered set - "Dog bites man" and "Man bites dog" produce identical attention scores for the same token pairs. This episode dives deep into Rotary Position Embeddings (RoPE): the math behind rotating Q and K vectors, where and how RoPE is applied in Bielik's architecture, and how to build an optimized single-pass Triton kernel that eliminates all transcendental operations from the hot path using a precomputed cos/sin cache.
kernels/attention/rope_cached.py- optimised single-pass cached RoPE kernel (rope_cached_kernel,build_rope_cache,apply_rope_cached_)
benchmarks/attention/benchmark_rope_cached.py- Triton vs PyTorch naive vs PyTorch+compile; two sweeps (seq_len, num_heads)
Back to Series Overview | Previous: RMSNorm and Softmax | Next: Flash Attention

