Open
Description
- In SFT step, the model I used is llama-7b-hf download from hugging face, and all datasets are default. here is my launch shell:
deepspeed --hostfile=$hosts_file train_sft.py \
--data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
--data_split 2,4,4 \
--model_name_or_path $path_to'llama-7b-hf/' \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--max_seq_len 512 \
--learning_rate 1e-4 \
--weight_decay 0.1 \
--num_train_epochs 2 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--gradient_checkpointing \
--zero_stage $ZERO_STAGE \
--lora_dim 128 \
--lora_module_name layers. \
--deepspeed \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
- In Reward step,I use OPT-350m and my launch shell is default:
deepspeed --hostfile=$hostfile main.py \
--data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
--data_split 2,4,4 \
--model_name_or_path facebook/opt-350m \
--num_padding_at_beginning 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--max_seq_len 512 \
--learning_rate 5e-5 \
--weight_decay 0.1 \
--disable_dropout \
--num_train_epochs 1 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--zero_stage $ZERO_STAGE \
--deepspeed \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
- In last PPO step, I set SFT result model to
ACTOR_MODEL_PATH
, and reward model toCRITIC_MODEL_PATH
, In launch shell, I disabled hybrid_engine, here is my launch shell:
ACTOR_MODEL_PATH=$path_to'/step1_supervised_finetuning/output/'
CRITIC_MODEL_PATH=$path_to'/step2_reward_model_finetuning/output/'
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT
Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=5e-4
Critic_Lr=5e-6
hostfile='hostfile'
deepspeed --hostfile=$hostfile main.py \
--data_path Dahoas/rm-static \
--data_split 2,4,4 \
--actor_model_name_or_path $ACTOR_MODEL_PATH \
--critic_model_name_or_path $CRITIC_MODEL_PATH \
--num_padding_at_beginning 1 \
--per_device_train_batch_size 4 \
--per_device_mini_train_batch_size 4 \
--generation_batch_numbers 1 \
--ppo_epochs 1 \
--max_answer_seq_len 256 \
--max_prompt_seq_len 256 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--actor_weight_decay 0.1 \
--critic_weight_decay 0.1 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 1 \
--num_warmup_steps 100 \
--deepspeed --seed 1234 \
--inference_tp_size 8 \
--tp_gather_partition_size 4 \
--actor_zero_stage $ACTOR_ZERO_STAGE \
--critic_zero_stage $CRITIC_ZERO_STAGE \
--actor_gradient_checkpointing \
--disable_actor_dropout \
--actor_lora_dim 128 \
--actor_lora_module_name layers. \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
In Step 1&2, everything works well.but in step 3,there is an error:
RuntimeError: Error(s) in loading state_dict for RewardModel size mismatch for rwtranrsformer.decoder.embed_tokens.weight: copying a param with shape torch.Size([50272, 512]) from checkpoint, the shape in current model is torch.Size([32000, 512]).
I find a issue: #358 about this, but after I pulled latest master branch, it also doesn't work.
Anyone met this before?