McGill-NLP · insop · Apr 8, 2025
diff --git a/nano_r1.ipynb b/nano_r1.ipynb
@@ -74,7 +74,7 @@
     "     $ y_1, y_2, \\cdots, y_G \\sim \\pi_\\theta(y|x) $\n",
     "\n",
     "     These $G$ responses form what is called a *group* in GRPO.\n",
-    "   - Compute a reward $R_i$ for each response and normalize them tocalculate the GRPO advantage within each group.\n",
+    "   - Compute a reward $R_i$ for each response and normalize them to calculate the GRPO advantage within each group.\n",
     "   - Create a list of $N \\times G$ episodes, i.e., pairs of $(x_i, y_i)$ along with their corresponding advantages.\n",
     "   - Estimate the policy gradient $\\vec{g}_{pg}$ from these episodes.\n",
     "   - Update the model parameters:  \n",