Hi, thanks for your great work! I have a question regarding the ablation experiment of DPO vs Step-DPO. What is the 5K training dataset like? Is it sampled from your publicly available 10K Step-DPO dataset? And then, for Step-DPO, you used the chosen and rejected from that dataset, while for DPO, you used full_chosen and full_rejected as the settings, is that correct?
Hi, thanks for your great work! I have a question regarding the ablation experiment of DPO vs Step-DPO. What is the 5K training dataset like? Is it sampled from your publicly available 10K Step-DPO dataset? And then, for Step-DPO, you used the chosen and rejected from that dataset, while for DPO, you used full_chosen and full_rejected as the settings, is that correct?