Much more memory used in step 3 when using multi gpus compared to using single gpu

**System Info:**
Memory: 500G
GPU: 8 * A100 80G
Question:
**Why using multi gpus in init of DeepSpeedRLHFEngine used much more memroy compared to using single gpu ?** 

**Reproduce:**
Copy model_load.py to DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning
Copy test_model_load.sh to DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node
**Test with 8 GPUs:** 
	cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning 
	bash training_scripts/single_node/test_model_load.sh
	**max memory used:** 500G 
	**logs:**
```
[2023-05-16 18:41:16,882] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
	[2023-05-16 18:41:17,031] [INFO] [runner.py:541:main] cmd = /opt/conda/envs/dschat/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=12346 --enable_each_rank_log=None model_load.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path /models/actor_models/llama-13B-lora --critic_model_name_or_path /models/reward_models/llama-7B --num_padding_at_beginning 0 --per_device_train_batch_size 4 --per_device_mini_train_batch_size 4 --generation_batch_numbers 1 --ppo_epochs 1 --max_answer_seq_len 512 --max_prompt_seq_len 512 --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --disable_actor_dropout --num_warmup_steps 100 --deepspeed --seed 1234 --actor_zero_stage 2 --critic_zero_stage 2 --actor_lora_dim 128 --critic_lora_dim 128 --critic_lora_module_name layers. --actor_lora_module_name layers. --only_optimize_lora --output_dir ./output
	[2023-05-16 18:41:19,234] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
	[2023-05-16 18:41:19,234] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
	[2023-05-16 18:41:19,234] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
	[2023-05-16 18:41:19,234] [INFO] [launch.py:247:main] dist_world_size=8
	[2023-05-16 18:41:19,235] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
	[2023-05-16 18:41:23,339] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
	************************[start] Initializing Actor Model [start] *************************
	[2023-05-16 18:43:03,035] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93127
	[2023-05-16 18:43:06,403] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93128
	[2023-05-16 18:43:09,065] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93129
	[2023-05-16 18:43:09,066] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93130
	[2023-05-16 18:43:12,093] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93131
	[2023-05-16 18:43:14,519] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93132
	[2023-05-16 18:43:17,460] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93133
	[2023-05-16 18:43:20,163] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93134
	[2023-05-16 18:43:23,026] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/envs/dschat/bin/python', '-u', 'model_load.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/models/actor_models/llama-13B-lora', '--critic_model_name_or_path', '/models/reward_models/llama-7B', '--num_padding_at_beginning', '0', '--per_device_train_batch_size', '4', '--per_device_mini_train_batch_size', '4', '--generation_batch_numbers', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '512', '--max_prompt_seq_len', '512', '--actor_learning_rate', '5e-4', '--critic_learning_rate', '5e-6', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '1', '--disable_actor_dropout', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--actor_zero_stage', '2', '--critic_zero_stage', '2', '--actor_lora_dim', '128', '--critic_lora_dim', '128', '--critic_lora_module_name', 'layers.', '--actor_lora_module_name', 'layers.', '--only_optimize_lora', '--output_dir', './output'] exits with return code = -9
```
	

**Test with 1 GPU:**
	cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning 
	CUDA_VISIBLE_DEVICES=0 bash training_scripts/single_node/test_model_load.sh

**max memory used:** 80G 
**logs:**
```
[2023-05-16 19:29:44,923] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
	Detected CUDA_VISIBLE_DEVICES=1: setting --include=localhost:1
	[2023-05-16 19:29:45,592] [INFO] [runner.py:541:main] cmd = /opt/conda/envs/dschat/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMV19 --master_addr=127.0.0.1 --master_port=12346 --enable_each_rank_log=None model_load.py --data_path Dahoas/rm-static --data_split 2,4,4 --actor_model_name_or_path /models/actor_models/llama-13B-lora --critic_model_name_or_path /models/reward_models/llama-7B-new --num_padding_at_beginning 0 --per_device_train_batch_size 4 --per_device_mini_train_batch_size 4 --generation_batch_numbers 1 --ppo_epochs 1 --max_answer_seq_len 512 --max_prompt_seq_len 512 --actor_learning_rate 5e-4 --critic_learning_rate 5e-6 --num_train_epochs 1 --lr_scheduler_type cosine --gradient_accumulation_steps 1 --disable_actor_dropout --num_warmup_steps 100 --deepspeed --seed 1234 --actor_zero_stage 2 --critic_zero_stage 2 --actor_lora_dim 128 --critic_lora_dim 128 --critic_lora_module_name layers. --actor_lora_module_name layers. --only_optimize_lora --output_dir ./output
	[2023-05-16 19:29:47,689] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [1]}
	[2023-05-16 19:29:47,689] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
	[2023-05-16 19:29:47,689] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
	[2023-05-16 19:29:47,689] [INFO] [launch.py:247:main] dist_world_size=1
	[2023-05-16 19:29:47,689] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=1
	[2023-05-16 19:29:51,316] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
	************************[start] Initializing Actor Model [start] *************************
	...
	*****************[end] Initialized Actor Model [end] (duration: 1162.76s)*****************
	*************************[start] Initializing Ref Model [start] **************************
	...
	******************[end] Initialized Ref Model [end] (duration: 100.52s)*******************
	************************[start] Initializing Critic Model [start] ************************
	...
	...
        *****************[end] Initialized Critic Model [end] (duration: 344.25s)*****************
        ************************[start] Initializing Reward Model [start] ************************
        Traceback (most recent call last):
          File "model_load.py", line 352, in <module>
            main()
          File "model_load.py", line 336, in main
            rlhf_engine = DeepSpeedRLHFEngine(
          File "/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 59, in __init__
            self.reward = self._init_reward(
          File "/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 269, in _init_reward
            reward_engine, *_ = deepspeed.initialize(model=reward_model,
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/deepspeed/__init__.py", line 165, in initialize
            engine = DeepSpeedEngine(args=args,
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 266, in __init__
            self._configure_distributed_model(model)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1037, in _configure_distributed_model
            self.module.to(self.device)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
            return self._apply(convert)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
            module._apply(fn)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
            module._apply(fn)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
            module._apply(fn)
          [Previous line repeated 2 more times]
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
            param_applied = fn(param)
          File "/opt/conda/envs/dschat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
            return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
        RuntimeError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 79.15 GiB total capacity; 78.03 GiB already allocated; 51.69 MiB free; 78.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
        [2023-05-16 20:31:26,346] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 121057
        [2023-05-16 20:31:27,001] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/envs/dschat/bin/python', '-u', 'model_load.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/models/actor_models/llama-13B-lora', '--critic_model_name_or_path', '/models/reward_models/llama-7B', '--num_padding_at_beginning', '0', '--per_device_train_batch_size', '4', '--per_device_mini_train_batch_size', '4', '--generation_batch_numbers', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '512', '--max_prompt_seq_len', '512', '--actor_learning_rate', '5e-4', '--critic_learning_rate', '5e-6', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '1', '--disable_actor_dropout', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--actor_zero_stage', '2', '--critic_zero_stage', '2', '--actor_lora_dim', '128', '--critic_lora_dim', '128', '--critic_lora_module_name', 'layers.', '--actor_lora_module_name', 'layers.', '--only_optimize_lora', '--output_dir', './output'] exits with return code = 1
```

[files.zip](https://github.com/microsoft/DeepSpeedExamples/files/11487684/files.zip)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Much more memory used in step 3 when using multi gpus compared to using single gpu #529

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Much more memory used in step 3 when using multi gpus compared to using single gpu #529

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions