diff --git a/nano_r1.ipynb b/nano_r1.ipynb index b9719dc..a59b56f 100644 --- a/nano_r1.ipynb +++ b/nano_r1.ipynb @@ -74,7 +74,7 @@ " $ y_1, y_2, \\cdots, y_G \\sim \\pi_\\theta(y|x) $\n", "\n", " These $G$ responses form what is called a *group* in GRPO.\n", - " - Compute a reward $R_i$ for each response and normalize them tocalculate the GRPO advantage within each group.\n", + " - Compute a reward $R_i$ for each response and normalize them to calculate the GRPO advantage within each group.\n", " - Create a list of $N \\times G$ episodes, i.e., pairs of $(x_i, y_i)$ along with their corresponding advantages.\n", " - Estimate the policy gradient $\\vec{g}_{pg}$ from these episodes.\n", " - Update the model parameters: \n",