Skip to content

training 12b model seems to require more memory than expected #447

Open
@ChaoChungWu-Johnson

Description

@ChaoChungWu-Johnson

Describe the bug
Hi, I was trying to finetune pythia-12b model via the following code in DeepSpeed-Chat 's step1 code.
main.py is from DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py

deepspeed main.py \
   --sft_only_data_path {my_dataset} \
   --data_split 10,0,0 \
   --model_name_or_path EleutherAI/pythia-12b-deduped \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --max_seq_len 512 \
   --learning_rate 9.65e-6 \
   --weight_decay 0. \
   --num_train_epochs 16  \
   --gradient_accumulation_steps 8 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --lora_dim 64 \
   --gradient_checkpointing \
   --zero_stage 3 \
   --deepspeed \
   --output_dir $OUTPUT_PATH \
   &> $OUTPUT_PATH/training.log

and according to zero3's estimation, the model finetuning should only take resources like

Some weights of the model checkpoint at EleutherAI/pythia-12b-deduped were not used when initializing GPTNeoXModel: ['embed_out.weight']
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 11586M total params, 259M largest layer params.
  per CPU  |  per GPU |   Options
  291.35GB |   0.97GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  517.96GB |   0.97GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  258.98GB |   3.66GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  517.96GB |   3.66GB | offload_param=none, offload_optimizer=cpu , zero_init=0
   11.60GB |  25.25GB | offload_param=none, offload_optimizer=none, zero_init=1
  517.96GB |  25.25GB | offload_param=none, offload_optimizer=none, zero_init=0

since I have 720 cpu RAM and 8* 32GB V100 GPU totally, this spec looks sufficient enough to run even with such small batch size (only 1 now)
but I still got OOM error, and memory occupied rate up to almost 100% (30GB~31GB/32GB) for each gpu. Any idea why it consume so much memory?

other alternative I try to deal with OOM: change --gradient_checkpointing into --only_optimize_lora:
but this resulted in index error, another bug I guess.
the error message of this is quite long, I'll paste it in the additional context.

To Reproduce
Steps to reproduce the behavior:
just run the code with the above environment setting and code.

Expected behavior
Should train succesfully.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+44dac51
deepspeed install path ........... ['/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 12.0
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 1.14, cuda 12.0

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: "Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)"
  • GPU count and types :one machine with 8 32GB V100
  • Python version: python 3.8.10

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
yes, with deepspeed

Docker context
Are you using a specific docker image that you can share? No.

Additional context
index error I try to use --only_optimize_lora instead of --gradient_checkpointing
the message is long, so I paste the latter part of it and remove the repeated message from other subthreads, if you need the whole message, please tell me!

[2023-04-26 15:22:51,673] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
[2023-04-26 15:22:51,673] [INFO] [utils.py:786:see_memory_usage] MA 3.86 GB         Max_MA 4.83 GB         CA 11.34 GB         Max_CA 11 GB
[2023-04-26 15:22:51,674] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 110.99 GB, percent = 14.7%
[2023-04-26 15:22:51,676] [INFO] [stage3.py:113:__init__] Reduce bucket size 500,000,000
[2023-04-26 15:22:51,676] [INFO] [stage3.py:114:__init__] Prefetch bucket size 30000000
Using /home/twsfphn198/.cache/torch_extensions/py38_cu120 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.5340464115142822 seconds
Traceback (most recent call last):
  File "main.py", line 345, in <module>
    main()
  File "main.py", line 290, in main
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
Loading extension module utils...
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1445, in _configure_zero_optimizer
Time to load utils op: 0.10356998443603516 seconds
Traceback (most recent call last):
      File "main.py", line 345, in <module>
optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
    main()
  File "main.py", line 290, in main
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1445, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range

[2023-04-26 15:22:59,098] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9668
[2023-04-26 15:22:59,102] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9669
[2023-04-26 15:22:59,104] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9670
[2023-04-26 15:22:59,318] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9671
[2023-04-26 15:22:59,320] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9673
[2023-04-26 15:22:59,321] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9675
[2023-04-26 15:22:59,323] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9677
[2023-04-26 15:22:59,323] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9679
[2023-04-26 15:22:59,617] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=7', '--sft_only_data_path', 'appier/martechQA', 'sharegpt', '--data_split', '2,4,4', '--model_name_or_path', 'EleutherAI/pythia-12b-deduped', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '1e-4', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--only_optimize_lora', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', '/home/twsfphn198/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/pythia-12b-deduped'] exits with return code = 1

Metadata

Metadata

Labels

deespeed chatDeepSpeed Chatnew-configA modified config from the given example

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions