Skip to content

Commit d8d4fa2

Browse files
authored
Merge branch 'microsoft:master' into master
2 parents 7ec112a + b965b9c commit d8d4fa2

File tree

25 files changed

+7496
-52
lines changed

25 files changed

+7496
-52
lines changed

Diff for: applications/DeepSpeed-Chat/dschat/utils/data/data_utils.py

-5
Original file line numberDiff line numberDiff line change
@@ -211,12 +211,7 @@ def create_dataset_split(current_dataset, raw_dataset, train_phase, tokenizer,
211211
padding="max_length",
212212
truncation=True,
213213
return_tensors="pt")
214-
chosen_token["input_ids"] = chosen_token["input_ids"]
215-
chosen_token["attention_mask"] = chosen_token["attention_mask"]
216214
chosen_dataset.append(chosen_token)
217-
218-
reject_token["input_ids"] = reject_token["input_ids"]
219-
reject_token["attention_mask"] = reject_token["attention_mask"]
220215
reject_dataset.append(reject_token)
221216
print(
222217
f'Creating dataset {raw_dataset.dataset_name_clean} for {train_phase=} size={len(chosen_dataset)}'
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# 🐕 Direct Preference Optimization (DPO) finetuning
2+
[Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290) is a novel approach to preference learning, which directly optimizes the policy without explicit reward modeling or reinforcement learning. It leverages a specific parameterization of the reward model that enables the extraction of the corresponding optimal policy in closed form. By using a simple classification loss, DPO aligns language models with human preferences, avoiding the complexity and instability often associated with RLHF.
3+
4+
As the paper says, "Your Language Model is Secretly a Reward Model." Therefore, the training arguments and the training process of DPO are mostly the same as the reward model, as shown in [step2 "Reward Model (RM) finetuning"](../step2_reward_model_finetuning/README.md). After the training of DPO, you will get a model that has been aligned with human preferences.
5+
6+
## 🏃 How to train the model
7+
8+
We provide the script for OPT-350m, which you can test by launching the command
9+
10+
```bash
11+
training_scripts/opt/single_node/run_350m.sh
12+
```
13+
14+
We also provide the script for llama2, which you can test by launching the command
15+
16+
```bash
17+
training_scripts/llama2/run_llama2_7b.sh
18+
```
19+
20+
## 🏃 How to evaluate the DPO checkpoint?
21+
22+
The checkpoint of DPO is exactly the language model that can be evaluated as [step1 "Supervised Finetuning"](../step1_supervised_finetuning/README.md).
23+
24+
## 💁 Datasets
25+
26+
Because DPO treats the language model as a reward model, the dataset for DPO is in the same format as that used for reward model fine-tuning. Each item in the dataset includes one "chosen" and one "rejected" output for the same input.

0 commit comments

Comments
 (0)