-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hi thanks for the questions! It's reasonable that the end steps for all episodes is 25 (I believe the max number of steps is set to 25 by default, and it can remain 25 even if you enable early stopping when the goal is achieved). As for the difference between `test_reward` and `test_bench/step_reward`, it's due to two major differences. First, the reward and benchmark loggers log things a little bit differently: (as far as I remember from my notes) the reward logger resets at the end of each episode whereas the benchmark logger resets only once at the collector's init(), so the trends can be different. Second, the `test_bench/step_reward` additionally divides the episode reward by the number of steps in each episode (i.e. avg reward per step). Please check the code for the reward and benchmark logger as well as `offpolicy_trainer` for your own understanding, and feel free to write your own logger for your purposes! Lmk if you have any other questions, thanks!
Originally posted by @zixianma in #3 (comment)
Thanks for your reply to the previous question. I follow your reminder and check the code for SimpleSpreadBenchmarkLogger and find the code that might be the key to the difference between these two metrics (i.e., test_reward and test_bench/step_reward). Here is the code:
alignment/map/tianshou/env/utils.py
Line 232 in 58754e4
| bench_data = elem['n'][0] |
Here you only add the info of the first agent (i.e., elem['n'][0]). However, for the default setting, there are 5 agents and the length of elem['n'] is 5. Moreover, each element in elem['n'] has different info for different agents, so the rewards can be different. This phenomenon does not occur in the computation of test_reward, so their trends are different. Could you help me check out if my understanding is correct? Thanks!