Skip to content

Commit 1a92702

Browse files
authored
Update docstring of normalize reward (#1136)
1 parent 6554907 commit 1a92702

File tree

2 files changed

+14
-2
lines changed

2 files changed

+14
-2
lines changed

Diff for: gymnasium/wrappers/stateful_reward.py

+7-1
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,9 @@
2020
class NormalizeReward(
2121
gym.Wrapper[ObsType, ActType, ObsType, ActType], gym.utils.RecordConstructorArgs
2222
):
23-
r"""Normalizes immediate rewards such that their exponential moving average has a fixed variance.
23+
r"""This wrapper will scale rewards s.t. the discounted returns have a mean of 0 and std of 1.
2424
25+
In a nutshell, the rewards are divided through by the standard deviation of a rolling discounted sum of the reward.
2526
The exponential moving average will have variance :math:`(1 - \gamma)^2`.
2627
2728
The property `_update_running_mean` allows to freeze/continue the running mean calculation of the reward
@@ -30,6 +31,11 @@ class NormalizeReward(
3031
3132
A vector version of the wrapper exists :class:`gymnasium.wrappers.vector.NormalizeReward`.
3233
34+
Important note:
35+
Contrary to what the name suggests, this wrapper does not normalize the rewards to have a mean of 0 and a standard
36+
deviation of 1. Instead, it scales the rewards such that **discounted returns** have approximately unit variance.
37+
See [Engstrom et al.](https://openreview.net/forum?id=r1etN1rtPB) on "reward scaling" for more information.
38+
3339
Note:
3440
In v0.27, NormalizeReward was updated as the forward discounted reward estimate was incorrectly computed in Gym v0.25+.
3541
For more detail, read [#3154](https://github.com/openai/gym/pull/3152).

Diff for: gymnasium/wrappers/vector/stateful_reward.py

+7-1
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,20 @@
1919

2020

2121
class NormalizeReward(VectorWrapper, gym.utils.RecordConstructorArgs):
22-
r"""This wrapper will normalize immediate rewards s.t. their exponential moving average has a fixed variance.
22+
r"""This wrapper will scale rewards s.t. the discounted returns have a mean of 0 and std of 1.
2323
24+
In a nutshell, the rewards are divided through by the standard deviation of a rolling discounted sum of the reward.
2425
The exponential moving average will have variance :math:`(1 - \gamma)^2`.
2526
2627
The property `_update_running_mean` allows to freeze/continue the running mean calculation of the reward
2728
statistics. If `True` (default), the `RunningMeanStd` will get updated every time `self.normalize()` is called.
2829
If False, the calculated statistics are used but not updated anymore; this may be used during evaluation.
2930
31+
Important note:
32+
Contrary to what the name suggests, this wrapper does not normalize the rewards to have a mean of 0 and a standard
33+
deviation of 1. Instead, it scales the rewards such that **discounted returns** have approximately unit variance.
34+
See [Engstrom et al.](https://openreview.net/forum?id=r1etN1rtPB) on "reward scaling" for more information.
35+
3036
Note:
3137
The scaling depends on past trajectories and rewards will not be scaled correctly if the wrapper was newly
3238
instantiated or the policy was changed recently.

0 commit comments

Comments
 (0)