When I use distillation with thinking chain, if the loss includes the thinking process and the answer, the prediction accuracy of the selected checkpoint with the lowest loss point will be very low. Do I need to modify the prediction so that it only calculates the loss of the answer and not the loss of the thinking process?