Skip to content

Fix(colab): correct spelling errors in training_apg notebook #2587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions mjx/training_apg.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
"Where $R(\\tau)$ is some function depending on the rollout $\\tau = \\{x_t, a_t\\}_{t=0}^{T}$. Despite this method's popularity and extensive research into its refinement, a fundamental property is that the gradient has high variance. This allows the optimizer to thoroughly explore the space of policies, leading to the robust and often surprisingly good policies that have been achieved. However, the variance comes at the cost of requiring many samples $(x_t, a_t)$ to converge.\n",
"\n",
"#### First-Order Policy Gradients (FoPG)\n",
"On the other hand, if you assume a deterministic state transition model $x_{t+1} = f(x_t, a_t)$, you end up with the first-order policy gradient. Other common names include Analytical Policy Gradients (APG) and Backpropogation through Time (BPTT). Unlike ZoPG methods, which model the state evolution as a probabilistic black box, the FoPG explicitly contains the jacobians of the simulation function f. For example, let's look at the gradient of the reward $r_t$, in the case that it only depends on state.\n",
"On the other hand, if you assume a deterministic state transition model $x_{t+1} = f(x_t, a_t)$, you end up with the first-order policy gradient. Other common names include Analytical Policy Gradients (APG) and Backpropagation through Time (BPTT). Unlike ZoPG methods, which model the state evolution as a probabilistic black box, the FoPG explicitly contains the jacobians of the simulation function f. For example, let's look at the gradient of the reward $r_t$, in the case that it only depends on state.\n",
"$$\n",
"\\frac{\\partial r_t}{\\partial \\theta} = \\frac{\\partial r_t}{\\partial x_t}\\frac{\\partial x_t}{\\partial \\theta} \n",
"$$\n",
Expand All @@ -75,15 +75,15 @@
"\n",
"<img src=\"../doc/images/mjx/apg_diagram.png\" alt=\"drawing\" width=\"300\"/>\n",
"\n",
"Note that there three distinct gradient chains in this example. The red pathway considers how the immediately prior action affected the state. The blue path explains the name *Backpropogation through Time*, capturing how actions affect downstream rewards. The least intuitive may be the green chain, which shows how the reward depends on how actions depend on previous actions.Experience shows that blocking *any* of these three pathways via jax.lax.stop_grad can badly hinder policy learning. As the length of $x_t$ backbone increases, [gradient explosion](https://arxiv.org/abs/2111.05803) becomes a crucial consideration. In practice, this can be resolved via decaying downstream gradients or periodically truncating the gradient.\n",
"Note that there three distinct gradient chains in this example. The red pathway considers how the immediately prior action affected the state. The blue path explains the name *Backpropagation through Time*, capturing how actions affect downstream rewards. The least intuitive may be the green chain, which shows how the reward depends on how actions depend on previous actions.Experience shows that blocking *any* of these three pathways via jax.lax.stop_grad can badly hinder policy learning. As the length of $x_t$ backbone increases, [gradient explosion](https://arxiv.org/abs/2111.05803) becomes a crucial consideration. In practice, this can be resolved via decaying downstream gradients or periodically truncating the gradient.\n",
"\n",
"**The Sharp Bits of FoPG's**\n",
"\n",
"While FoPG's have been shown to be very sample efficient, especially as the [dimension of the state space increases](https://arxiv.org/abs/2204.07137), one fundamental shortcoming is that due to the lower gradient variance, FoPG's also have less exploration power than ZoPG's and benefit from the practioner being more explicit in the problem formulation.\n",
"While FoPG's have been shown to be very sample efficient, especially as the [dimension of the state space increases](https://arxiv.org/abs/2204.07137), one fundamental shortcoming is that due to the lower gradient variance, FoPG's also have less exploration power than ZoPG's and benefit from the practitioner being more explicit in the problem formulation.\n",
"\n",
"Additionally, discontinuous reward formulations are ubiquitious in RL, for instance, a large penalty when the robot falls. It can be significantly more [challenging](https://arxiv.org/abs/2403.14864) to design robust policies with FoPG's, since they cannot backprop through such penalties.\n",
"Additionally, discontinuous reward formulations are ubiquitous in RL, for instance, a large penalty when the robot falls. It can be significantly more [challenging](https://arxiv.org/abs/2403.14864) to design robust policies with FoPG's, since they cannot backprop through such penalties.\n",
"\n",
"Last, despite the sample efficiency, FoPG methods can still struggle with wall-clock time. Because the gradients have low variance, they do not benefit significantly from massive parallelization of data collection - unlike [RL](https://arxiv.org/abs/2109.11978). Additionally, the policy gradient is typically calculated via autodifferentiation. This can be 3-5x slower than unrolling the simulation forward, and memory intensive, with memory requirements scaling with $O(m \\cdot (m+n) \\cdot T)$, where m and n are the state and control dimensions, $m \\cdot (m+n)$ is the jacobian dimension, and T is the number of steps propogated through.\n",
"Last, despite the sample efficiency, FoPG methods can still struggle with wall-clock time. Because the gradients have low variance, they do not benefit significantly from massive parallelization of data collection - unlike [RL](https://arxiv.org/abs/2109.11978). Additionally, the policy gradient is typically calculated via autodifferentiation. This can be 3-5x slower than unrolling the simulation forward, and memory intensive, with memory requirements scaling with $O(m \\cdot (m+n) \\cdot T)$, where m and n are the state and control dimensions, $m \\cdot (m+n)$ is the jacobian dimension, and T is the number of steps propagated through.\n",
"\n",
"Note that with certain models, using autodifferentiation through mjx.step currently causes [nan gradients](https://github.com/google-deepmind/mujoco/issues/1517). For now, we address this issue by using double-precision floats, at the cost of doubling the memory requirements and training time.\n"
]
Expand Down Expand Up @@ -178,7 +178,7 @@
"print('Setting environment variable to use GPU rendering:')\n",
"%env MUJOCO_GL=egl\n",
"\n",
"# Check if installation was succesful.\n",
"# Check if installation was successful.\n",
"try:\n",
" print('Checking that the installation succeeded:')\n",
" import mujoco\n",
Expand Down