Direct Preference Optimization (DPO) is a training method for fine-tuning language models using preference data — pairs of responses labeled as preferred vs rejected — without requiring reinforcement learning or a separate reward model. DPO was introduced in Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
For more information on using our DPO implementation, visit its model page in our documentation.