Can I use Llama-Factory to train StepDPO in the same way as training DPO if I concatenate `initial_reason_steps` into `prompt`?
Can I use Llama-Factory to train StepDPO in the same way as training DPO if I concatenate
initial_reason_stepsintoprompt?