Skip to content

Question about advantage calculation #60

Open
@leeacord

Description

@leeacord

Hi,

I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:

def actor_loss(self, seq, target):
# Actions: 0 [a1] [a2] a3
# ^ | ^ | ^ |
# / v / v / v
# States: [z0]->[z1]-> z2 -> z3
# Targets: t0 [t1] [t2]
# Baselines: [v0] [v1] v2 v3
# Entropies: [e1] [e2]
# Weights: [ 1] [w1] w2 w3
# Loss: l1 l2
metrics = {}
# Two states are lost at the end of the trajectory, one for the boostrap
# value prediction and one because the corresponding action does not lead
# anywhere anymore. One target is lost at the start of the trajectory
# because the initial state comes from the replay buffer.
policy = self.actor(tf.stop_gradient(seq['feat'][:-2]))
if self.config.actor_grad == 'dynamics':
objective = target[1:]
elif self.config.actor_grad == 'reinforce':
baseline = self._target_critic(seq['feat'][:-2]).mode()
advantage = tf.stop_gradient(target[1:] - baseline)
action = tf.stop_gradient(seq['action'][1:-1])
objective = policy.log_prob(action) * advantage

advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())

Based on my understanding:

  • seq['feat'] contains time steps from 0 to horizon.
  • target contains time steps from 0 to horizon-1, since the value at the last step is used as a bootstrap for lambda_return.
  • Therefore, baseline in Line 271 includes time steps from 0 to horizon-2, and target[1:] includes time steps from 1 to horizon-1.

If I understand correctly, the code uses $V_{t+1}^{\lambda} - v_\xi\left(\hat{z}_t\right)$ to calculate the advantage,

not $V_t^{\lambda} - v_{\xi}(\hat{z}_t)$ as stated in the paper?

I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions