Description
Hi~
While running step2 reward model training, I got a strange result after one epoch training:
***** Evaluating reward, Epoch 1/1 *****
chosen_last_scores (higher is better) : -9.388486862182617, acc (higher is better) : 0.5991161465644836.
I wonder what's wrong with my training script?
my training script is as below:
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
ZERO_STAGE=0
fi
mkdir -p $OUTPUT
export CUDA_VISIBLE_DEVICES=1
deepspeed --master_port 29501 --include localhost:1 main.py --model_name_or_path facebook/opt-350m
--data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP
--num_padding_at_beginning 1 --gradient_accumulation_steps 2 --zero_stage $ZERO_STAGE
--per_device_train_batch_size 8 --per_device_eval_batch_size 16 --num_train_epochs 1
--deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log
The training process is running on a 32G V100-Tesla GPU