Hello, thank you for your great work! And I have some question about the step-dpo, the dataset on hf ("xinlai/Math-Step-DPO-10K") seems like taking "prompt" as input, and use "chosen" and "rejected" during training ("full_chosen" and "full_rejected" is assumingly not used. ), then under this circumstanding, won't the model tend to generate the partial response during inference? I am not sure if I am understanding right here, feel free to correct me, thank you!
Hello, thank you for your great work! And I have some question about the step-dpo, the dataset on hf ("xinlai/Math-Step-DPO-10K") seems like taking "prompt" as input, and use "chosen" and "rejected" during training ("full_chosen" and "full_rejected" is assumingly not used. ), then under this circumstanding, won't the model tend to generate the partial response during inference? I am not sure if I am understanding right here, feel free to correct me, thank you!