Open
Description
Approved by @SalmanMohammadi. This issue's purpose is to determine which contrastive optimization methods should be added in torchtune and track methods that are implementing right now. For all of the methods, reference implementation exists. Here is the list of the methods (most methods are from A* alignment workshops) with their implementation state and some info about them (HF models trained with method, limitations, notes about actuality):
- DPO: Classic method, with which constrative methods are usually compared. Models: >6k, Status: Implemented DPO #645
- SimPO: Current SOTA method according to their paper. Models: ~252, Limitations: The most important limitation is performance drop on reasoning datasets. Method is actually lack of robustness for tasks where specific tokens are important. For instance,
2 + 2 = 5
is a bad result compared to2 + 2 = 4
. For the v0.2 version, it can't be normally used in cases where structured outputs are required, but it is actually a version where robustness to learning rate is maximized. See also general limitations. Status: Implemented, SimPO (Simple Preference Optimisation) #1223 - IPO: Basic log-ratio method; I am not sure of its actuality. Models: <180, Limitations: Are not really clear from the paper. See also general limitations. Status: Stale, remove ipo loss + small fixed #1615
- KTO: Kahneman-Tversky value function-based method, which looks really interesting and also was used for reasoning tasks. Models: ~531 Limitations: It is not a real limitation, but if your data is of enough quality, DPO will work better than this method. Status: Not implemented, issue Adding a KTO Optimizer #1793
- ORPO: Pretty popular and stable log-ratio method, which has actually worse performance than SimPO, same computational and memory efficiency, but it does not require SFT initialized model. Models: ~1220, Limitations: Problems with reasoning datasets (checked emperically). See also general limitations. Status: Not implemented.
- R-DPO: Really interesting, but non-trivial method, closest to SimPO by performance. Generally, it is a reward-model (incorporated in DPO loss) method with teacher/student distillation. Models: 29, Limitations: Are not really clear from the paper. It is fully robust to noise that might be generated while building synthetic. Also, it looks like it has better robustness to hyperparameters than log-ratio methods. The reward model might be counted as a potential limitation. Status: Not implemented.
- CPO: Was actually designed for translation tasks, log-ratio method; not sure about its actuality. Models: <120, Limitations: Probably pretty similar to other log-ratio methods' limitations. See also: General limitations. Status: Not implemented, Implement CPO (Contrastive Preference Optimization) #1290
- SMPO: Generally, it is the newest method, which by some experiments beats SimPO. There is an official implementation, but no published paper or pre-print, so we will wait for it. Limitations: See also: General limitations. Status: Not implemented,
General limitations:
For all log-ratio methods, it is obviously robustness to hyperparameters, especially the most critical, learning rate. Accurate transcription of them from the original paper is crucial to having good results.