Skip to content

Commit cf1a34d

Browse files
committed
v2: block comment
1 parent 0fa36cd commit cf1a34d

File tree

1 file changed

+4
-6
lines changed

1 file changed

+4
-6
lines changed

tianshou/trainer/trainer.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1089,15 +1089,13 @@ def _update_step(
10891089
# just for logging, no functional role
10901090
self._policy_update_time += training_stat.train_time
10911091

1092-
# Note 1: this is the main difference to the off-policy trainer!
1093-
# The second difference is that batches of data are sampled without replacement
1094-
# during training, whereas in off-policy or offline training, the batches are
1095-
# sampled with replacement (and potentially custom prioritization).
10961092
# Note 2: in the policy-update we modify the buffer, which is not very clean.
10971093
# currently the modification will erase previous samples but keep things like
1098-
# _ep_rew and _ep_len. This means that such quantities can no longer be computed
1094+
# _ep_rew and _ep_len (b/c keep_statistics=True). This is needed since the collection might have stopped
1095+
# in the middle of an episode and in the next collect iteration we need these numbers to compute correct
1096+
# return and episode length values. With the current code structure, this means that after an update and buffer reset
1097+
# such quantities can no longer be computed
10991098
# from samples still contained in the buffer, which is also not clean
1100-
# TODO: improve this situation
11011099
self.params.train_collector.reset_buffer(keep_statistics=True)
11021100

11031101
# The step is the number of mini-batches used for the update, so essentially

0 commit comments

Comments
 (0)