skyzh
diff --git a/‎book/src/SUMMARY.md
Lines changed: 1 addition & 1 deletion b/‎book/src/SUMMARY.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/src/week1-01-attention.md
Lines changed: 15 additions & 1 deletion b/‎book/src/week1-01-attention.md
Lines changed: 15 additions & 1 deletion
diff --git a/‎book/src/week1-02-positional-embeddings.md
Lines changed: 0 additions & 1 deletion b/‎book/src/week1-02-positional-embeddings.md
Lines changed: 0 additions & 1 deletion
diff --git a/‎book/src/week1-02-positional-encodings.md
Lines changed: 80 additions & 0 deletions b/‎book/src/week1-02-positional-encodings.md
Lines changed: 80 additions & 0 deletions
diff --git a/‎book/src/week1-overview.md
Lines changed: 17 additions & 0 deletions b/‎book/src/week1-overview.md
Lines changed: 17 additions & 0 deletions
diff --git a/‎src/tiny_llm/multi_head_attention.py
Lines changed: 0 additions & 28 deletions b/‎src/tiny_llm/multi_head_attention.py
Lines changed: 0 additions & 28 deletions
diff --git a/‎src/tiny_llm_week1_ref/attention.py
Lines changed: 9 additions & 9 deletions b/‎src/tiny_llm_week1_ref/attention.py
Lines changed: 9 additions & 9 deletions
diff --git a/‎src/tiny_llm_week1_ref/multi_head_attention.py
Lines changed: 0 additions & 65 deletions b/‎src/tiny_llm_week1_ref/multi_head_attention.py
Lines changed: 0 additions & 65 deletions
diff --git a/‎tests/test_attention.py
Lines changed: 29 additions & 30 deletions b/‎tests/test_attention.py
Lines changed: 29 additions & 30 deletions
@@ -7,7 +7,7 @@
 
 - [Week 1: From Matmul to Text](./week1-overview.md)
     - [Attention and Multi-Head Attention](./week1-01-attention.md)
-    - [Positional Embeddings and RoPE]()
+    - [Positional Encodings and RoPE](./week1-02-positional-encodings.md)
     - [Grouped/Multi Query Attention]()
     - [Multilayer Perceptron Layer and Transformer]()
     - [Wiring the Qwen2 Model]()
 
@@ -22,6 +22,13 @@ we will pass a tensor of the shape `N.. x 1024 x 512` to the attention layer.
 
 ## Task 1: Implement `scaled_dot_product_attention`
 
+In this task, we will implement the scaled dot product attention function.
+
+```
+poetry run pytest tests -k week_1_day_1_task_1 -v
+```
+
+
 **📚 Readings**
 
 * [Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)
@@ -51,6 +58,7 @@ K: 1 x H x L x D
 V: 1 x H x L x D
 Q: 1 x H x L x D
 output: 1 x H x L x D
+mask: 1 x H x L x L
 ```
 
 .. though the attention layer only cares about the last two dimensions. The test case will test any shape of the batching dimension.
@@ -64,6 +72,12 @@ poetry run pytest tests -k test_attention_with_mask
 
 ## Task 2: Implement `MultiHeadAttention`
 
+In this task, we will implement the multi-head attention layer.
+
+```
+src/tiny_llm/attention.py
+```
+
 **📚 Readings**
 
 * [Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)
@@ -100,7 +114,7 @@ W_o: (H x D) x E
 At the end of the day, you should be able to pass the following tests:
 
 ```
-poetry run pytest tests -k test_multi_head_attention
+poetry run pytest tests -k week_1_day_1_task_2 -v
 ```
 
 {{#include copyright.md}}
@@ -0,0 +1,80 @@
+# Week 1 Day 2: Positional Encodings and RoPE
+
+In day 2, we will implement the positional embedding used in the Qwen2 model: Rotary Postional Encoding. In a transformer
+model, we need a way to embed the information of the position of a token into the input of the attention layers. In Qwen2,
+positional embedding is applied within the multi head attention layer on the query and key vectors.
+
+**📚 Readings**
+
+- [You could have designed state of the art positional encoding](https://huggingface.co/blog/designing-positional-encoding)
+- [Roformer: Enhanced Transformer with Rotary Positional Encoding](https://arxiv.org/pdf/2104.09864)
+
+## Task 1: Implement Rotary Postional Encoding "RoPE"
+
+You will need to modify the following file:
+
+```
+src/tiny_llm/positional_encoding.py
+```
+
+In traditional RoPE (as described in the readings), the positional encoding is applied to each head of the query and key vectors.
+You can pre-compute the frequencies when initializing the `RoPE` class.
+
+If `offset` is not provided, the positional encoding will be applied to the entire sequence: 0th frequency applied to the
+0th token, up to the (L-1)-th token. Otherwise, the positional encoding will be applied to the sequence according to the
+offset slice. If the offset slice is 5..10, then the sequence length provided to the layer would be 5, and the 0th token
+will be applied with the 5th frequency.
+
+```
+x: (N, L, H, D)
+cos/sin_freqs: (MAX_SEQ_LEN, D // 2)
+```
+
+In the traditional form of RoPE, each head on the dimension of `D` is viewed as consequtive complex pairs. That is to
+say, if D = 8, then, x[0] and x[1] are a pair, x[2] and x[3] are another pair, and so on. A pair gets the same frequency
+from `cos/sin_freqs`.
+
+```
+output[0] = x[0] * cos_freqs[0] + x[1] * sin_freqs[0]
+output[1] = x[0] * -sin_freqs[0] + x[1] * cos_freqs[0]
+...and so on
+```
+
+You can do this by reshaping `x` to (N, L, H, D // 2, 2) and then applying the above formula to each pair.
+
+**📚 Readings**
+
+- [PyTorch RotaryPositionalEmbeddings API](https://pytorch.org/torchtune/stable/generated/torchtune.modules.RotaryPositionalEmbeddings.html)
+- [MLX Implementation of RoPE before the custom metal kernel implementation](https://github.com/ml-explore/mlx/pull/676/files)
+
+You can test your implementation by running the following command:
+
+```
+poetry run pytest tests -k week_1_day_2_task_1 -v
+```
+
+## Task 2: Implement `RoPE` in the non-traditional form
+
+The Qwen2 model uses a non-traditional form of RoPE. In this form, the head embedding dimension is split into two halves,
+and the two halves are applied with different frequencies.
+
+```
+output[0] = x[0] * cos_freqs[0] + x[HALF_DIM] * sin_freqs[0]
+output[HALF_DIM] = x[0] * -sin_freqs[0] + x[HALF_DIM] * cos_freqs[0]
+output[1] = x[1] * cos_freqs[1] + x[HALF_DIM + 1] * sin_freqs[1]
+output[HALF_DIM + 1] = x[1] * -sin_freqs[1] + x[HALF_DIM + 1] * cos_freqs[1]
+...and so on
+```
+
+You can do this by directly getting the first half / second half of the embedding dimension of `x` and applying the
+frequencies to each half separately.
+
+You can test your implementation by running the following command:
+
+```
+poetry run pytest tests -k week_1_day_2_task_2 -v
+```
+
+**📚 Readings**
+
+- [vLLM implementation of RoPE](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/rotary_embedding.py)
@@ -28,4 +28,21 @@ To make the journey as interesting as possible, we will skip a few things for no
 * Loading the model weights -- I don't think it's an interesting thing to learn how to decode those tensor dump files, so
   we will use the `mlx_lm` to load the model and steal the weights from the loaded model into our layer implementations.
 
+## Qwen2 Models
+
+You can try the Qwen2 model with MLX/vLLM. You can read the blog post below to have some idea of what we will build
+within this course. At the end of this week, we will be able to chat with the model -- that is to say, use Qwen2 to
+generate text, as a casual language model.
+
+The reference implementation of the Qwen2 model can be found in huggingface transformers, vLLM, and mlx-lm. You may
+utilize these resources to better understand the internals of the model and what we will implement in this week.
+
+**📚 Readings**
+
+- [Qwen2.5: A Party of Foundation Models!](https://qwenlm.github.io/blog/qwen2.5/)
+- [Key Concepts of the Qwen2 Model](https://qwen.readthedocs.io/en/latest/getting_started/concepts.html)
+- [Huggingface Transformers - Qwen2](https://github.com/huggingface/transformers/tree/main/src/transformers/models/qwen2)
+- [vLLM Qwen2](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen2.py)
+- [mlx-lm Qwen2](https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/models/qwen2.py)
+
 {{#include copyright.md}}
@@ -81,22 +81,22 @@ def __call__(
         value: mx.array,
         mask: mx.array | None = None,
     ) -> mx.array:
-        n_batches = query.shape[0]
-        seq_len = query.shape[1]
+        N, L, E = query.shape
+        assert query.shape == key.shape == value.shape
         projection_q = (
             linear(query, self.wq)
-            .reshape(n_batches, self.num_heads, seq_len, self.head_dim)
-            .transpose(1, 0, 2, 3)
+            .reshape(N, L, self.num_heads, self.head_dim)
+            .transpose(0, 2, 1, 3)
         )
         projection_k = (
             linear(key, self.wk)
-            .reshape(n_batches, self.num_heads, seq_len, self.head_dim)
-            .transpose(1, 0, 2, 3)
+            .reshape(N, L, self.num_heads, self.head_dim)
+            .transpose(0, 2, 1, 3)
         )
         projection_v = (
             linear(value, self.wv)
-            .reshape(n_batches, self.num_heads, seq_len, self.head_dim)
-            .transpose(1, 0, 2, 3)
+            .reshape(N, L, self.num_heads, self.head_dim)
+            .transpose(0, 2, 1, 3)
         )
         x = scaled_dot_product_attention(
             projection_q,
@@ -105,5 +105,5 @@ def __call__(
             scale=self.scale,
             mask=mask,
         )
-        x = x.transpose(1, 0, 2).reshape(n_batches, seq_len, self.hidden_size)
+        x = x.transpose(0, 2, 1, 3).reshape(N, L, self.hidden_size)
         return linear(x, self.wo)
@@ -8,7 +8,7 @@
 
 @pytest.mark.parametrize("stream", AVAILABLE_STREAMS, ids=AVAILABLE_STREAMS_IDS)
 @pytest.mark.parametrize("precision", PRECISIONS, ids=PRECISION_IDS)
-def test_attention_simple(stream: mx.Stream, precision: np.dtype):
+def test_attention_week_1_day_1_task_1(stream: mx.Stream, precision: np.dtype):
     with mx.stream(stream):
         BATCH_SIZE = 3
         DIM_N = 4
@@ -35,18 +35,18 @@ def test_attention_simple(stream: mx.Stream, precision: np.dtype):
 @pytest.mark.parametrize(
     "qkv_shape", [True, False], ids=["with_seq_len", "without_seq_len"]
 )
-def test_attention_with_mask(stream: mx.Stream, precision: np.dtype, qkv_shape: bool):
+def test_attention_with_mask_week_1_day_1_task_1(stream: mx.Stream, precision: np.dtype, qkv_shape: bool):
     with mx.stream(stream):
         BATCH_SIZE = 3
         SEQ_LEN = 10
-        DIM_N = 4
-        DIM_M = 5
+        H = 4
+        D = 5
         if qkv_shape:
-            qkv_shape = (BATCH_SIZE, SEQ_LEN, DIM_N, DIM_M)
-            mask_shape = (BATCH_SIZE, SEQ_LEN, DIM_N, DIM_N)
+            qkv_shape = (BATCH_SIZE, H, SEQ_LEN, D)
+            mask_shape = (BATCH_SIZE, H, SEQ_LEN, SEQ_LEN)
         else:
-            qkv_shape = (BATCH_SIZE, DIM_N, DIM_M)
-            mask_shape = (BATCH_SIZE, DIM_N, DIM_N)
+            qkv_shape = (BATCH_SIZE, H, SEQ_LEN, D)
+            mask_shape = (BATCH_SIZE, H, SEQ_LEN, SEQ_LEN)
         for _ in range(100):
             query = np.random.rand(*qkv_shape).astype(precision)
             key = np.random.rand(*qkv_shape).astype(precision)
@@ -72,33 +72,31 @@ def test_attention_with_mask(stream: mx.Stream, precision: np.dtype, qkv_shape:
 
 @pytest.mark.parametrize("stream", AVAILABLE_STREAMS, ids=AVAILABLE_STREAMS_IDS)
 @pytest.mark.parametrize("precision", PRECISIONS, ids=PRECISION_IDS)
-def test_multi_head_attention(stream: mx.Stream, precision: np.dtype):
+def test_multi_head_attention_week_1_day_1_task_2(stream: mx.Stream, precision: np.dtype):
     with mx.stream(stream):
-        BATCH_SIZE = 7
-        DIM_N = 11
-        DIM_M = 9
-        NUM_HEADS = 3
+        SEQ_LEN = 11
+        D = 9
+        H = 3
+        BATCH_SIZE = 10
         for _ in range(100):
-            query = np.random.rand(BATCH_SIZE, DIM_N, DIM_M).astype(precision)
-            key = np.random.rand(BATCH_SIZE, DIM_N, DIM_M).astype(precision)
-            value = np.random.rand(BATCH_SIZE, DIM_N, DIM_M).astype(precision)
-            q_proj_weight = np.random.rand(DIM_M, DIM_M).astype(precision)
-            k_proj_weight = np.random.rand(DIM_M, DIM_M).astype(precision)
-            v_proj_weight = np.random.rand(DIM_M, DIM_M).astype(precision)
-            out_proj_weight = np.random.rand(DIM_M, DIM_M).astype(precision)
-            mask = np.random.rand(DIM_N * NUM_HEADS, BATCH_SIZE, BATCH_SIZE).astype(
-                precision
-            )
+            query = np.random.rand(BATCH_SIZE, SEQ_LEN, H * D).astype(precision)
+            key = np.random.rand(BATCH_SIZE, SEQ_LEN, H * D).astype(precision)
+            value = np.random.rand(BATCH_SIZE, SEQ_LEN, H * D).astype(precision)
+            q_proj_weight = np.random.rand(H * D, H * D).astype(precision)
+            k_proj_weight = np.random.rand(H * D, H * D).astype(precision)
+            v_proj_weight = np.random.rand(H * D, H * D).astype(precision)
+            out_proj_weight = np.random.rand(H * D, H * D).astype(precision)
+            mask = np.random.rand(SEQ_LEN, SEQ_LEN).astype(precision)
             reference_output, _ = torch.nn.functional.multi_head_attention_forward(
-                torch.tensor(query, device=TORCH_DEVICE),
-                torch.tensor(key, device=TORCH_DEVICE),
-                torch.tensor(value, device=TORCH_DEVICE),
-                num_heads=NUM_HEADS,
+                torch.tensor(query, device=TORCH_DEVICE).transpose(0, 1),
+                torch.tensor(key, device=TORCH_DEVICE).transpose(0, 1),
+                torch.tensor(value, device=TORCH_DEVICE).transpose(0, 1),
+                num_heads=H,
                 q_proj_weight=torch.tensor(q_proj_weight, device=TORCH_DEVICE),
                 k_proj_weight=torch.tensor(k_proj_weight, device=TORCH_DEVICE),
                 v_proj_weight=torch.tensor(v_proj_weight, device=TORCH_DEVICE),
                 out_proj_weight=torch.tensor(out_proj_weight, device=TORCH_DEVICE),
-                embed_dim_to_check=DIM_M,
+                embed_dim_to_check=H * D,
                 in_proj_weight=None,
                 in_proj_bias=None,
                 bias_k=None,
@@ -109,9 +107,10 @@ def test_multi_head_attention(stream: mx.Stream, precision: np.dtype):
                 use_separate_proj_weight=True,
                 attn_mask=torch.tensor(mask, device=TORCH_DEVICE),
             )
+            reference_output = reference_output.transpose(0, 1)
             user_output = MultiHeadAttention(
-                DIM_M,
-                NUM_HEADS,
+                H * D,
+                H,
                 mx.array(q_proj_weight),
                 mx.array(k_proj_weight),
                 mx.array(v_proj_weight),