Description
Describe the bug
Hi, I was trying to finetune pythia-12b model via the following code in DeepSpeed-Chat 's step1 code.
main.py is from DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py
deepspeed main.py \
--sft_only_data_path {my_dataset} \
--data_split 10,0,0 \
--model_name_or_path EleutherAI/pythia-12b-deduped \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--max_seq_len 512 \
--learning_rate 9.65e-6 \
--weight_decay 0. \
--num_train_epochs 16 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--lora_dim 64 \
--gradient_checkpointing \
--zero_stage 3 \
--deepspeed \
--output_dir $OUTPUT_PATH \
&> $OUTPUT_PATH/training.log
and according to zero3's estimation, the model finetuning should only take resources like
Some weights of the model checkpoint at EleutherAI/pythia-12b-deduped were not used when initializing GPTNeoXModel: ['embed_out.weight']
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 11586M total params, 259M largest layer params.
per CPU | per GPU | Options
291.35GB | 0.97GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
517.96GB | 0.97GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
258.98GB | 3.66GB | offload_param=none, offload_optimizer=cpu , zero_init=1
517.96GB | 3.66GB | offload_param=none, offload_optimizer=cpu , zero_init=0
11.60GB | 25.25GB | offload_param=none, offload_optimizer=none, zero_init=1
517.96GB | 25.25GB | offload_param=none, offload_optimizer=none, zero_init=0
since I have 720 cpu RAM and 8* 32GB V100 GPU totally, this spec looks sufficient enough to run even with such small batch size (only 1 now)
but I still got OOM error, and memory occupied rate up to almost 100% (30GB~31GB/32GB) for each gpu. Any idea why it consume so much memory?
other alternative I try to deal with OOM: change --gradient_checkpointing into --only_optimize_lora:
but this resulted in index error, another bug I guess.
the error message of this is quite long, I'll paste it in the additional context.
To Reproduce
Steps to reproduce the behavior:
just run the code with the above environment setting and code.
Expected behavior
Should train succesfully.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+44dac51
deepspeed install path ........... ['/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 12.0
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 1.14, cuda 12.0
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: "Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)"
- GPU count and types :one machine with 8 32GB V100
- Python version: python 3.8.10
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
yes, with deepspeed
Docker context
Are you using a specific docker image that you can share? No.
Additional context
index error I try to use --only_optimize_lora instead of --gradient_checkpointing
the message is long, so I paste the latter part of it and remove the repeated message from other subthreads, if you need the whole message, please tell me!
[2023-04-26 15:22:51,673] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
[2023-04-26 15:22:51,673] [INFO] [utils.py:786:see_memory_usage] MA 3.86 GB Max_MA 4.83 GB CA 11.34 GB Max_CA 11 GB
[2023-04-26 15:22:51,674] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 110.99 GB, percent = 14.7%
[2023-04-26 15:22:51,676] [INFO] [stage3.py:113:__init__] Reduce bucket size 500,000,000
[2023-04-26 15:22:51,676] [INFO] [stage3.py:114:__init__] Prefetch bucket size 30000000
Using /home/twsfphn198/.cache/torch_extensions/py38_cu120 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.5340464115142822 seconds
Traceback (most recent call last):
File "main.py", line 345, in <module>
main()
File "main.py", line 290, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
Loading extension module utils...
self._configure_optimizer(optimizer, model_parameters)
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1445, in _configure_zero_optimizer
Time to load utils op: 0.10356998443603516 seconds
Traceback (most recent call last):
File "main.py", line 345, in <module>
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
main()
File "main.py", line 290, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1445, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
[2023-04-26 15:22:59,098] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9668
[2023-04-26 15:22:59,102] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9669
[2023-04-26 15:22:59,104] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9670
[2023-04-26 15:22:59,318] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9671
[2023-04-26 15:22:59,320] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9673
[2023-04-26 15:22:59,321] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9675
[2023-04-26 15:22:59,323] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9677
[2023-04-26 15:22:59,323] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9679
[2023-04-26 15:22:59,617] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=7', '--sft_only_data_path', 'appier/martechQA', 'sharegpt', '--data_split', '2,4,4', '--model_name_or_path', 'EleutherAI/pythia-12b-deduped', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '1e-4', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--only_optimize_lora', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', '/home/twsfphn198/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/pythia-12b-deduped'] exits with return code = 1