Update hyperparameters to reflect InstructGPT (#1966)

sanagno · web-flow · commit 90bc0c9b42c6 · 2023-03-05T12:50:05.000+01:00
I noticed that there are some non-standard hyperparameter values (namely: Adam betas and weight decay), so I suggest considering experimenting with the values proposed by InstructGPT [1], which are quite standard for LLM training AFAIK: > All models are trained with the Adam optimizer, with β1 = 0.9 and β2 = 0.95. > We train our SFT models for 16 epochs with residual dropout of 0.2. We use a cosine LR schedule down to 10% of the original learning rate, with no learning rate warmup. For our 1.3B and 6B models, we use an LR of 9.65e-6 and a batch size of 32. For 175B, we use a LR of 5.03e-6 and a batch size of 8. Unfortunately, I don't have easy access to compute power, so I will have to let someone else validate whether or not these changes have any (desired) effect. [1] https://arxiv.org/abs/2203.02155
diff --git a/model/model_training/configs/config.yaml b/model/model_training/configs/config.yaml
@@ -4,6 +4,9 @@ defaults:
   gradient_accumulation_steps: 32
   per_device_train_batch_size: 2
   per_device_eval_batch_size: 2
+  adam_beta1: 0.9
+  adam_beta2: 0.95
+  adam_epsilon: 1e-8
   weight_decay: 0.00
   warmup_steps: 600
   eval_steps: 500
@@ -64,7 +67,7 @@ oa_dataset_only:
 pythia:
   learning_rate: 8e-6
   model_name: EleutherAI/pythia-70m-deduped
-  weight_decay: 0.01
+  weight_decay: 0.0
   max_length: 520
   warmup_steps: 1000
   gradient_checkpointing: false
@@ -76,7 +79,7 @@ pythia:
 pythia-1B:
   learning_rate: 8e-6
   model_name: EleutherAI/pythia-1b-deduped
-  weight_decay: 0.01
+  weight_decay: 0.0
   max_length: 520
   warmup_steps: 1000
   gradient_checkpointing: false
@@ -87,28 +90,28 @@ pythia-1B:
 galactica-125m:
   learning_rate: 5e-5
   model_name: facebook/galactica-125m
-  weight_decay: 0.01
+  weight_decay: 0.0
   warmup_steps: 600
   gradient_checkpointing: false
   gradient_accumulation_steps: 2
   per_device_train_batch_size: 4
   per_device_eval_batch_size: 4
 
 gpt-jt:
-  learning_rate: 2e-6
+  learning_rate: 8e-6
   model_name: togethercomputer/GPT-JT-6B-v1
-  weight_decay: 0.01
+  weight_decay: 0.0
   max_length: 1024
   warmup_steps: 600
   gradient_checkpointing: false
-  gradient_accumulation_steps: 2
+  gradient_accumulation_steps: 8
   per_device_train_batch_size: 4
   per_device_eval_batch_size: 4
 
 codegen:
   learning_rate: 8e-6
   model_name: Salesforce/codegen-2B-multi
-  weight_decay: 0.01
+  weight_decay: 0.0
   max_length: 520
   warmup_steps: 1000
   gradient_checkpointing: false
diff --git a/model/model_training/trainer_sft.py b/model/model_training/trainer_sft.py
@@ -252,6 +252,9 @@ def argument_parsing(notebook=False, notebook_args=None):
         gradient_accumulation_steps=training_conf.gradient_accumulation_steps,
         per_device_train_batch_size=training_conf.per_device_train_batch_size,
         per_device_eval_batch_size=training_conf.per_device_eval_batch_size,
+        adam_beta1=training_conf.adam_beta1,
+        adam_beta2=training_conf.adam_beta2,
+        adam_epsilon=float(training_conf.adam_epsilon),
         weight_decay=training_conf.weight_decay,
         max_grad_norm=training_conf.max_grad_norm,
         logging_steps=training_conf.logging_steps,