Skip to content

Deepspeed stage 3 crashing with student + teacher  #17319

@andrasiani

Description

@andrasiani

Bug description

Hi,
I have a 1.5 B param GPT-XL pretrained teacher network in fp16 with requires_grad=False. The student network is a small GPT with 142 M params.
I use pytorch lightning and in train step I first call teacher then student. But the build_net method returns student network so optimizer should contain only student weights.

I managed to use deepspeed 2, but deepspeed 3 crashes.

Is there any way to partition weights of student only, will deepspeed stage 3 partition weights of teacher too?

For the future I am interested in reducing memory footprint of teacher, can deepspeed be used to partition teacher weights in this case?
I'd really appeciate your guidance, thanks!

What version are you seeing the problem on?

No response

How to reproduce the bug

No response

Error messages and logs

                 transformer_outputs = self.teacher_transformer(x)                                                                                                           
                  File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1109, in _call_impl                
                    result = hook(self, input)                                                                                                                                  
                  File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn                     
                    ret_val = func(*args, **kwargs)                                                                                                                             
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 348, in           
                _pre_forward_module_hook                                                                                                                                        
                    self.pre_sub_module_forward_function(module)                                                                                                                
                  File "/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context           
                    return func(*args, **kwargs)                                                                                                                                
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 478, in           
                pre_sub_module_forward_function                                                                                                                                 
                    param_coordinator.fetch_sub_module(sub_module)                                                                                                              
                  File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn                     
                    ret_val = func(*args, **kwargs)                                                                                                                             
                  File "/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context           
                    return func(*args, **kwargs)                                                                                                                                
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 349,  
                in fetch_sub_module                                                                                                                                             
                    self.__all_gather_params(params_to_prefetch)                                                                                                                
                  File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn                     
                    ret_val = func(*args, **kwargs)                                                                                                                             
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 399,  
                in __all_gather_params                                                                                                                                          
                    handle = partitioned_params[0].all_gather_coalesced(partitioned_params)                                                                                     
                  File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn                     
                    ret_val = func(*args, **kwargs)                                                                                                                             
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 861, in        
                all_gather_coalesced                                                                                                                                            
                    for p in params),                                                                                                                                           
                  File "/lib/python3.7/site-packages/deepspeed/runtime/utils.py", line 870, in get_only_unique_item       
                    raise RuntimeError(f"expected there to be only one unique element in {items}")                                                                              
                RuntimeError: expected there to be only one unique element in <generator object                                                                                 
                Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>                                                            
                                                                                                                                                                                
       ERROR    (RANK-0) RuntimeError occurred: expected there to be only one unique element in <generator object                                                               
                Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>                                                            
                Traceback (most recent call last):                                                                                                                              
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 349,  
                in fetch_sub_module                                                                                                                                             
                    self.__all_gather_params(params_to_prefetch)                                                                                                                
                  File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn                     
                    ret_val = func(*args, **kwargs)                                                                                                                             
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 399,  
                in __all_gather_params                                                                                                                                          
                    handle = partitioned_params[0].all_gather_coalesced(partitioned_params)                                                                                     
                  File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn                     
                    ret_val = func(*args, **kwargs)                                                                                                                             
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 861, in        
                all_gather_coalesced                                                                                                                                            
                    for p in params),                                                                                                                                           
                  File "/lib/python3.7/site-packages/deepspeed/runtime/utils.py", line 870, in get_only_unique_item       
                    raise RuntimeError(f"expected there to be only one unique element in {items}")                                                                              
                RuntimeError: expected there to be only one unique element in <generator object                                                                                 
                Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>                                                            
                                                                                                                                                                                                                              
       ERROR    Error running step in dev mode:                                                                                                                                 
                RuntimeError occurred: expected there to be only one unique element in <generator object                                                                        
                Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>                                                            
                Traceback (most recent call last):                                                                                                                              
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 349,  
                in fetch_sub_module                                                                                                                                             
                    self.__all_gather_params(params_to_prefetch)                                                                                                                
                  File "lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn                     
                    ret_val = func(*args, **kwargs)                                                                                                                             
                  File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 399,  
                in __all_gather_params                                                                                                                                          
                    handle = partitioned_params[0].all_gather_coalesced(partitioned_params)                                                                                     
                  File "lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn                     
                    ret_val = func(*args, **kwargs)                                                                                                                             
                  File "lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 861, in        
                all_gather_coalesced                                                                                                                                            
                    for p in params),                                                                                                                                           
                  File "/lib/python3.7/site-packages/deepspeed/runtime/utils.py", line 870, in get_only_unique_item       
                    raise RuntimeError(f"expected there to be only one unique element in {items}")                                                                              
                RuntimeError: expected there to be only one unique element in <generator object                                                                                 
                Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>                                                            

Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions