Open
Description
Here are my situation:
- finished step 2 with cohere/zhihu_query dataset. The final reward score is 5.07, rejected score is 0.8, and the acc is 0.79. So the step 2 seems sucessful.
- when I attempt to step 3. I met loss scale maximum problem which solved by change the learning rate(actor & critic). Then I met a problem, The Critic loss can not decrease. In many experiments, It changed from 4 to 7 or stay with 5.
here are my problems:
- I tried to test the model(actor). I found the actor model's performence is better than the sft model. Is it normal?
- The actor loss = - advantage * clip(ratio). I obtain the actor loss in my log, it changed from -0.1 to -2. So the clip(ratio) is around 0.8-1.2, This means the advantage is bigger than 0 and inscreased during training. Advantage means the action take by the actor model is berrter or bad than average(baseline), so bigger advantage is better and smaller actor loss is better( since the advantage bigger, the actor loss is smaller)?
looking forward to your reply
thanks.