A bug will lead to a failed greedy policy 训练代码有一个bug会使得greedy策略失效

我们在使用该仓库代码`rl_trainer/main.py`进行我们模型的训练时，发现了一个代码中的bug

We have found a bug when we are using your code `rl_trainer/main.py` to train our own model

```python
	    # …… 58-66
	    state_to_training = state[0] # you have defined state_to_training here
	    # …… 68-78
            while True:
            	# …… 80-86
                actions = logits_greedy(state_to_training, logits, height, width) # Here you use state_to_train to generate greedy policy
				# …… 87-90 
				next_state, reward, done, _, info = env.step(env.encode(actions))
				next_state_to_training = next_state[0] # create new varible next_state_to_training
				next_obs = get_observations(next_state_to_training, ctrl_agent_index, obs_dim, height, width)
				# …… 90-116
				model.replay_buffer.push(obs, logits, step_reward, next_obs, done)

				model.update()

				obs = next_obs
				step += 1
				# …… 123-146
```

代码里面定义了`state_to_training`，greedy策略也是使用`state_to_training`作为观测，但是后续代码并未将更新后的状态`next_state_to_training`赋给`state_to_training`，使得greedy策略一直观测的是开始时的状态。当然，对于我们自己模型的训练并没有影响，因为`get_observations`用的是`next_state_to_training`。但这个bug会使得greedy策略失效，有可能比random还差

You have define `state_to_training` at the beginning of the code, which is above the loop of a training episode. During the one episode training, you have used`state_to_training` as an observation for greedy policy. But, you haven't updated `state_to_training` using the updated state `next_state_to_training`, which would make the greedy policy continuously observing the state at the very beginning. Of course, it doesn't matter the training of our own model, because the argument passing to `get_observations` is `next_state_to_training`. We suppose that such a bug will make the greedy policy failed, maybe worse than random policy.

所以应该在更新`obs`的时候也更新`state_to_training`

The supposed code to fix the bug is as following:

```python
	    # …… 58-66
	    state_to_training = state[0] # you have defined state_to_training here
	    # …… 68-78
            while True:
            	# …… 80-86
                actions = logits_greedy(state_to_training, logits, height, width) # Here you use state_to_train to generate greedy policy
				# …… 87-90 
				next_state, reward, done, _, info = env.step(env.encode(actions))
				next_state_to_training = next_state[0] # create new varible next_state_to_training
				next_obs = get_observations(next_state_to_training, ctrl_agent_index, obs_dim, height, width)
				# …… 90-116
				model.replay_buffer.push(obs, logits, step_reward, next_obs, done)

				model.update()

				obs = next_obs
				state_to_training = next_state_to_training
				step += 1
				# …… 123-146
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A bug will lead to a failed greedy policy 训练代码有一个bug会使得greedy策略失效 #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A bug will lead to a failed greedy policy 训练代码有一个bug会使得greedy策略失效 #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions