Question about advantage calculation

Hi,

I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:
https://github.com/danijar/dreamerv2/blob/07d906e9c4322c6fc2cd6ed23e247ccd6b7c8c41/dreamerv2/agent.py#L252-L274

```python
advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())
```

Based on my understanding:

- `seq['feat']` contains time steps from `0` to `horizon`.
- `target` contains time steps from `0` to `horizon-1`, since the value at the last step is used as a bootstrap for `lambda_return`.
- Therefore, `baseline` in Line 271 includes time steps from `0` to `horizon-2`, and `target[1:]` includes time steps from `1` to `horizon-1`.

If I understand correctly, the code uses $V_{t+1}^{\lambda} - v_\xi\left(\hat{z}_t\right)$ to calculate the advantage, 

not $V_t^{\lambda} - v_{\xi}(\hat{z}_t)$  as stated in the paper?

I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!

	def actor_loss(self, seq, target):
	# Actions: 0 [a1] [a2] a3
	# ^ \| ^ \| ^ \|
	# / v / v / v
	# States: [z0]->[z1]-> z2 -> z3
	# Targets: t0 [t1] [t2]
	# Baselines: [v0] [v1] v2 v3
	# Entropies: [e1] [e2]
	# Weights: [ 1] [w1] w2 w3
	# Loss: l1 l2
	metrics = {}
	# Two states are lost at the end of the trajectory, one for the boostrap
	# value prediction and one because the corresponding action does not lead
	# anywhere anymore. One target is lost at the start of the trajectory
	# because the initial state comes from the replay buffer.
	policy = self.actor(tf.stop_gradient(seq['feat'][:-2]))
	if self.config.actor_grad == 'dynamics':
	objective = target[1:]
	elif self.config.actor_grad == 'reinforce':
	baseline = self._target_critic(seq['feat'][:-2]).mode()
	advantage = tf.stop_gradient(target[1:] - baseline)
	action = tf.stop_gradient(seq['action'][1:-1])
	objective = policy.log_prob(action) * advantage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about advantage calculation #60

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about advantage calculation #60

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions