Skip to content

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

@TracyPlus

Description

@TracyPlus

I run the following train.sh on Mistral-7b:

accelerate launch finetune.py \
    --output-dir output/yarn-mistral-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 8 \
    --max-position-embeddings 4096tr \
    --dataset emozilla/yarn-train-tokenized-16k-mistral \
    --sliding-window-attention-schedule 65536 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

with accelerate config as:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but I encountered OutOfMemory problem on my 80G A800s:
截屏2024-04-06 20 19 52

截屏2024-04-06 20 19 05

I don't know if there's something wrong with my distributed training configuration、、🥺
Hope someone help me、、、🙏🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions