training 12b model seems to require more memory than expected

**Describe the bug**
Hi, I was trying to finetune  pythia-12b model via the following code in DeepSpeed-Chat 's step1 code.
main.py is from `DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py`
```
deepspeed main.py \
   --sft_only_data_path {my_dataset} \
   --data_split 10,0,0 \
   --model_name_or_path EleutherAI/pythia-12b-deduped \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --max_seq_len 512 \
   --learning_rate 9.65e-6 \
   --weight_decay 0. \
   --num_train_epochs 16  \
   --gradient_accumulation_steps 8 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --lora_dim 64 \
   --gradient_checkpointing \
   --zero_stage 3 \
   --deepspeed \
   --output_dir $OUTPUT_PATH \
   &> $OUTPUT_PATH/training.log
   ```
and according to zero3's estimation, the model finetuning should only take resources like
```
Some weights of the model checkpoint at EleutherAI/pythia-12b-deduped were not used when initializing GPTNeoXModel: ['embed_out.weight']
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 11586M total params, 259M largest layer params.
  per CPU  |  per GPU |   Options
  291.35GB |   0.97GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  517.96GB |   0.97GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  258.98GB |   3.66GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  517.96GB |   3.66GB | offload_param=none, offload_optimizer=cpu , zero_init=0
   11.60GB |  25.25GB | offload_param=none, offload_optimizer=none, zero_init=1
  517.96GB |  25.25GB | offload_param=none, offload_optimizer=none, zero_init=0
  ```
since I have 720 cpu RAM and 8* 32GB V100 GPU totally, this spec looks sufficient enough to run even with such small batch size (only 1 now)
but I still got OOM error, and memory occupied rate up to almost 100% (30GB~31GB/32GB) for each gpu. Any idea why it consume so much memory?

other alternative I try to deal with OOM: change --gradient_checkpointing into --only_optimize_lora:
but this resulted in index error, another bug I guess.
the error message of this is quite long, I'll paste it in the additional context.

**To Reproduce**
Steps to reproduce the behavior:
just run the code with the above  environment setting and code.

**Expected behavior**
Should train succesfully.

**ds_report output**
```
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+44dac51
deepspeed install path ........... ['/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 12.0
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 1.14, cuda 12.0
```

**Screenshots**
If applicable, add screenshots to help explain your problem.

**System info (please complete the following information):**
 - OS: "Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)"
 - GPU count and types :one machine  with 8 32GB V100
 - Python version: python 3.8.10

**Launcher context**
Are you launching your experiment with the `deepspeed` launcher, MPI, or something else?
yes, with deepspeed

**Docker context**
Are you using a specific docker image that you can share? No.

**Additional context**
index error I try to use --only_optimize_lora instead of --gradient_checkpointing
the message is long, so I paste the latter part of it and remove the repeated message from other subthreads, if you need the whole message, please tell me!

```
[2023-04-26 15:22:51,673] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
[2023-04-26 15:22:51,673] [INFO] [utils.py:786:see_memory_usage] MA 3.86 GB         Max_MA 4.83 GB         CA 11.34 GB         Max_CA 11 GB
[2023-04-26 15:22:51,674] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 110.99 GB, percent = 14.7%
[2023-04-26 15:22:51,676] [INFO] [stage3.py:113:__init__] Reduce bucket size 500,000,000
[2023-04-26 15:22:51,676] [INFO] [stage3.py:114:__init__] Prefetch bucket size 30000000
Using /home/twsfphn198/.cache/torch_extensions/py38_cu120 as PyTorch extensions root...
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.5340464115142822 seconds
Traceback (most recent call last):
  File "main.py", line 345, in <module>
    main()
  File "main.py", line 290, in main
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
Loading extension module utils...
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1445, in _configure_zero_optimizer
Time to load utils op: 0.10356998443603516 seconds
Traceback (most recent call last):
      File "main.py", line 345, in <module>
optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
    main()
  File "main.py", line 290, in main
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1445, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/home/twsfphn198/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range

[2023-04-26 15:22:59,098] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9668
[2023-04-26 15:22:59,102] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9669
[2023-04-26 15:22:59,104] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9670
[2023-04-26 15:22:59,318] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9671
[2023-04-26 15:22:59,320] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9673
[2023-04-26 15:22:59,321] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9675
[2023-04-26 15:22:59,323] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9677
[2023-04-26 15:22:59,323] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9679
[2023-04-26 15:22:59,617] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=7', '--sft_only_data_path', 'appier/martechQA', 'sharegpt', '--data_split', '2,4,4', '--model_name_or_path', 'EleutherAI/pythia-12b-deduped', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '1e-4', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--only_optimize_lora', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', '/home/twsfphn198/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/pythia-12b-deduped'] exits with return code = 1
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

training 12b model seems to require more memory than expected #447

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

training 12b model seems to require more memory than expected #447

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions