SFT training ,single gpu (V100 32G), how to adjust my parameters to avoid OOM, thx

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 0; 31.75
GiB total capacity; 23.21 GiB already allocated; 2.43 GiB free; 25.59 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF
[2023-04-21 19:09:43,054] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15706
[2023-04-21 19:09:43,055] [ERROR] [launch.py:434:sigkill_handler] ['/data/anaconda3/bin/python', '-u', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py', '--local_rank=0', '--model_name_or_path', '/data/bloom-1b1', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--deepspeed', '--output_dir', '/data/deepspeed_output/step1/output_sft_0421_bloom1b1', '--per_device_train_batch_size', '1', '--num_train_epochs', '1', '--data_path', 'xc_data', '--gradient_checkpointing', '--zero_stage', '3'] exits with return code = 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT training ,single gpu (V100 32G), how to adjust my parameters to avoid OOM, thx #389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SFT training ,single gpu (V100 32G), how to adjust my parameters to avoid OOM, thx #389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions