Open
Description
I am testing the 1.3B training. Steps 1 and 2 have already passed, but there is no change in reward after completing step 3.
I used LoRa to train for one iteration, and the results of steps 1 and 2 are as follows:
step1:
ppl: 2.18959641456604
I let chatgpt extracting the logs for step 3 and comparing them with the demo logs provided in the project. I found that the absolute value of my loss is significantly smaller, and the reward seems to be completely random without any noticeable increase. (stand)