OOM error of distributed training on 80GB GPUs with Mistral-7b

I run the following train.sh on Mistral-7b:
```
accelerate launch finetune.py \
    --output-dir output/yarn-mistral-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 8 \
    --max-position-embeddings 4096tr \
    --dataset emozilla/yarn-train-tokenized-16k-mistral \
    --sliding-window-attention-schedule 65536 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000
```
with accelerate config as:
```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
but I encountered OutOfMemory problem on my 80G A800s:
<img width="1132" alt="截屏2024-04-06 20 19 52" src="https://github.com/jquesnelle/yarn/assets/127192024/26387810-e394-408e-898c-091dbc28b7d0">


<img width="1134" alt="截屏2024-04-06 20 19 05" src="https://github.com/jquesnelle/yarn/assets/127192024/c1662136-93be-4fd8-ac90-d67515df04c4">

I don't know if there's something wrong with my distributed training configuration、、🥺
Hope someone help me、、、🙏🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions