Skip to content

Commit 90bc0c9

Browse files
authored
Update hyperparameters to reflect InstructGPT (#1966)
I noticed that there are some non-standard hyperparameter values (namely: Adam betas and weight decay), so I suggest considering experimenting with the values proposed by InstructGPT [1], which are quite standard for LLM training AFAIK: > All models are trained with the Adam optimizer, with β1 = 0.9 and β2 = 0.95. > We train our SFT models for 16 epochs with residual dropout of 0.2. We use a cosine LR schedule down to 10% of the original learning rate, with no learning rate warmup. For our 1.3B and 6B models, we use an LR of 9.65e-6 and a batch size of 32. For 175B, we use a LR of 5.03e-6 and a batch size of 8. Unfortunately, I don't have easy access to compute power, so I will have to let someone else validate whether or not these changes have any (desired) effect. [1] https://arxiv.org/abs/2203.02155
2 parents 34b9535 + 051e025 commit 90bc0c9

2 files changed

Lines changed: 13 additions & 7 deletions

File tree

model/model_training/configs/config.yaml

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ defaults:
44
gradient_accumulation_steps: 32
55
per_device_train_batch_size: 2
66
per_device_eval_batch_size: 2
7+
adam_beta1: 0.9
8+
adam_beta2: 0.95
9+
adam_epsilon: 1e-8
710
weight_decay: 0.00
811
warmup_steps: 600
912
eval_steps: 500
@@ -64,7 +67,7 @@ oa_dataset_only:
6467
pythia:
6568
learning_rate: 8e-6
6669
model_name: EleutherAI/pythia-70m-deduped
67-
weight_decay: 0.01
70+
weight_decay: 0.0
6871
max_length: 520
6972
warmup_steps: 1000
7073
gradient_checkpointing: false
@@ -76,7 +79,7 @@ pythia:
7679
pythia-1B:
7780
learning_rate: 8e-6
7881
model_name: EleutherAI/pythia-1b-deduped
79-
weight_decay: 0.01
82+
weight_decay: 0.0
8083
max_length: 520
8184
warmup_steps: 1000
8285
gradient_checkpointing: false
@@ -87,28 +90,28 @@ pythia-1B:
8790
galactica-125m:
8891
learning_rate: 5e-5
8992
model_name: facebook/galactica-125m
90-
weight_decay: 0.01
93+
weight_decay: 0.0
9194
warmup_steps: 600
9295
gradient_checkpointing: false
9396
gradient_accumulation_steps: 2
9497
per_device_train_batch_size: 4
9598
per_device_eval_batch_size: 4
9699

97100
gpt-jt:
98-
learning_rate: 2e-6
101+
learning_rate: 8e-6
99102
model_name: togethercomputer/GPT-JT-6B-v1
100-
weight_decay: 0.01
103+
weight_decay: 0.0
101104
max_length: 1024
102105
warmup_steps: 600
103106
gradient_checkpointing: false
104-
gradient_accumulation_steps: 2
107+
gradient_accumulation_steps: 8
105108
per_device_train_batch_size: 4
106109
per_device_eval_batch_size: 4
107110

108111
codegen:
109112
learning_rate: 8e-6
110113
model_name: Salesforce/codegen-2B-multi
111-
weight_decay: 0.01
114+
weight_decay: 0.0
112115
max_length: 520
113116
warmup_steps: 1000
114117
gradient_checkpointing: false

model/model_training/trainer_sft.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,9 @@ def argument_parsing(notebook=False, notebook_args=None):
252252
gradient_accumulation_steps=training_conf.gradient_accumulation_steps,
253253
per_device_train_batch_size=training_conf.per_device_train_batch_size,
254254
per_device_eval_batch_size=training_conf.per_device_eval_batch_size,
255+
adam_beta1=training_conf.adam_beta1,
256+
adam_beta2=training_conf.adam_beta2,
257+
adam_epsilon=float(training_conf.adam_epsilon),
255258
weight_decay=training_conf.weight_decay,
256259
max_grad_norm=training_conf.max_grad_norm,
257260
logging_steps=training_conf.logging_steps,

0 commit comments

Comments
 (0)