I run `train_baseline.py`, and after some iterations, I got information like this:  The `policy_reward_mean` always equals `0`. I do not know whether this result is correct.