Open
Description
Hi,
I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:
Lines 252 to 274 in 07d906e
advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())
Based on my understanding:
seq['feat']
contains time steps from0
tohorizon
.target
contains time steps from0
tohorizon-1
, since the value at the last step is used as a bootstrap forlambda_return
.- Therefore,
baseline
in Line 271 includes time steps from0
tohorizon-2
, andtarget[1:]
includes time steps from1
tohorizon-1
.
If I understand correctly, the code uses
not
I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!
Metadata
Metadata
Assignees
Labels
No labels