File tree Expand file tree Collapse file tree 1 file changed +4
-6
lines changed
Expand file tree Collapse file tree 1 file changed +4
-6
lines changed Original file line number Diff line number Diff line change @@ -1089,15 +1089,13 @@ def _update_step(
10891089 # just for logging, no functional role
10901090 self ._policy_update_time += training_stat .train_time
10911091
1092- # Note 1: this is the main difference to the off-policy trainer!
1093- # The second difference is that batches of data are sampled without replacement
1094- # during training, whereas in off-policy or offline training, the batches are
1095- # sampled with replacement (and potentially custom prioritization).
10961092 # Note 2: in the policy-update we modify the buffer, which is not very clean.
10971093 # currently the modification will erase previous samples but keep things like
1098- # _ep_rew and _ep_len. This means that such quantities can no longer be computed
1094+ # _ep_rew and _ep_len (b/c keep_statistics=True). This is needed since the collection might have stopped
1095+ # in the middle of an episode and in the next collect iteration we need these numbers to compute correct
1096+ # return and episode length values. With the current code structure, this means that after an update and buffer reset
1097+ # such quantities can no longer be computed
10991098 # from samples still contained in the buffer, which is also not clean
1100- # TODO: improve this situation
11011099 self .params .train_collector .reset_buffer (keep_statistics = True )
11021100
11031101 # The step is the number of mini-batches used for the update, so essentially
You can’t perform that action at this time.
0 commit comments