Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a training method for fine-tuning language models using preference data — pairs of responses labeled as preferred vs rejected — without requiring reinforcement learning or a separate reward model. DPO was introduced in Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

For more information on using our DPO implementation, visit its model page in our documentation.