FIX: penalty for episode length termination#453
FIX: penalty for episode length termination#453TheWill-Of-D wants to merge 1 commit intokscalelabs:masterfrom
Conversation
episode length termination penalized policy for reaching the episode termination
|
Have you tested this change? Whether an episode length termination penalizes or not can be controlled by setting Lines 187 to 203 in 88c8c2d |
|
I ran the pytest and am currently training with the modification. The chaotic movements are gone, so far its ok. I can update you after the training run. You're right, This equation Check section 3 in this paper. |
|
Here Lines 937 to 957 in 88c8c2d Seems like You're right that terminating episodes absent of failures will discourage learning. Thats why we bootstrap from the value function and add those values to the final reward, if and only if this happens. To the model it looks as if the termination never happened. See this PR #410 That said it would be interesting to see how your changes affect the critic loss and the total reward in an A/B test. Interesting paper btw. |
|
EDIT: We already do this so can be disregarded
|
|
Found the same issue in SB3 gymnasium documentation https://arxiv.org/pdf/1712.00378 Merging only "values" across episodes doesn't negatively affect training as the properties environment itself haven't changed much. this is the standard method for non-finite long-horizon tasks (as shown in the above links). |
episode length termination penalized policy for reaching the episode's end. Policy may learn to avoid it/avoid learning gaits that are stable long term.