Skip to content

OOM problem when fine-tune reward model with LLaMA in step 2 #521

Open
@kiseliu

Description

@kiseliu

cd training/step2_reward_model_finetuning/
bash training_scripts/single_node/run_llama.sh

run_llama.sh contains

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=0
fi
mkdir -p $OUTPUT

deepspeed main.py \
   --data_path some_data \
   --data_split 2,4,4 \
   --model_name_or_path path_to_llama \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --gradient_checkpointing \
   --max_seq_len 512 \
   --learning_rate 5e-5 \
   --weight_decay 0.1 \
   --num_train_epochs 1 \
   --disable_dropout \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --zero_stage $ZERO_STAGE \
   --deepspeed \
   --output_dir $OUTPUT \
   &> $OUTPUT/training.log

Even if I set the per_device_train_batch_size = 1 and use gradient_checkpointing, I still have an OOM problem. Any solutions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    deespeed chatDeepSpeed ChatllamaQuestions related to llama model

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions