General tracker of RLHF methods addition

Approved by @SalmanMohammadi. This issue's purpose is to determine which contrastive optimization methods should be added in torchtune and track methods that are implementing right now. For all of the methods, reference implementation exists. Here is the list of the methods (most methods are from A* alignment workshops) with their implementation state and some info about them (HF models trained with method, limitations, notes about actuality):

- [x] DPO: Classic method, with which constrative methods are usually compared. **Models:** >6k, Status: Implemented #645
- [x] SimPO: Current SOTA method according to their paper. **Models:** ~252, **Limitations:** The most important limitation is performance drop on reasoning datasets. Method is actually lack of robustness for tasks where specific tokens are important. For instance, `2 + 2 = 5` is a bad result compared to `2 + 2 = 4`. For the v0.2 version, it can't be normally used in cases where structured outputs are required, but it is actually a version where robustness to learning rate is maximized. See also general limitations. **Status:** Implemented, #1223
- [ ] IPO: Basic log-ratio method; I am not sure of its actuality. **Models:** <180, Limitations: Are not really clear from the paper. See also general limitations. **Status:** Stale, #1615
- [ ] KTO: Kahneman-Tversky value function-based method, which looks really interesting and also was used for reasoning tasks. **Models:** ~531 **Limitations:** It is not a real limitation, but if your data is of enough quality, DPO will work better than this method. **Status:** Not implemented, issue #1793
- [ ] ORPO: Pretty popular and stable log-ratio method, which has actually worse performance than SimPO, same computational and memory efficiency, but it does not require SFT initialized model. **Models:** ~1220, **Limitations:** Problems with reasoning datasets (checked emperically). See also general limitations. **Status:** Not implemented.
- [ ] R-DPO: Really interesting, but non-trivial method, closest to SimPO by performance. Generally, it is a reward-model (incorporated in DPO loss) method with teacher/student distillation. **Models:** 29, Limitations: Are not really clear from the paper. It is fully robust to noise that might be generated while building synthetic. Also, it looks like it has better robustness to hyperparameters than log-ratio methods. The reward model might be counted as a potential limitation. **Status:** Not implemented.
- [ ] CPO: Was actually designed for translation tasks, log-ratio method; not sure about its actuality. **Models:** <120, **Limitations:** Probably pretty similar to other log-ratio methods' limitations. See also: General limitations. **Status:** Not implemented, #1290 
- [ ] SMPO: Generally, it is the newest method, which by some experiments beats SimPO. There is an official implementation, but no published paper or pre-print, so we will wait for it. **Limitations:** See also: General limitations. **Status:** Not implemented,

**General limitations:**

For all log-ratio methods, it is obviously robustness to hyperparameters, especially the most critical, learning rate. Accurate transcription of them from the original paper is crucial to having good results. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

General tracker of RLHF methods addition #1850

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

General tracker of RLHF methods addition #1850

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions