Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion nano_r1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@
" $ y_1, y_2, \\cdots, y_G \\sim \\pi_\\theta(y|x) $\n",
"\n",
" These $G$ responses form what is called a *group* in GRPO.\n",
" - Compute a reward $R_i$ for each response and normalize them tocalculate the GRPO advantage within each group.\n",
" - Compute a reward $R_i$ for each response and normalize them to calculate the GRPO advantage within each group.\n",
" - Create a list of $N \\times G$ episodes, i.e., pairs of $(x_i, y_i)$ along with their corresponding advantages.\n",
" - Estimate the policy gradient $\\vec{g}_{pg}$ from these episodes.\n",
" - Update the model parameters: \n",
Expand Down