Open
Description
Context
The Llama 3.1 self-attention builder takes RoPE embeddings as an argument, allowing us to build RoPE a single time across all layers. However, the corresponding components for Llama2 and Llama3 do not do this -- they instead construct RoPE for every single layer.
Whether we should use a single global RoPE vs one RoPE per layer? We should standardize this.
Originally posted by @ebsmothers in #2282 (comment)