-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
questionFurther information is requestedFurther information is requestedstrategy: deepspeedver: 2.0.x
Description
Bug description
Hi,
I have a 1.5 B param GPT-XL pretrained teacher network in fp16 with requires_grad=False. The student network is a small GPT with 142 M params.
I use pytorch lightning and in train step I first call teacher then student. But the build_net method returns student network so optimizer should contain only student weights.
I managed to use deepspeed 2, but deepspeed 3 crashes.
Is there any way to partition weights of student only, will deepspeed stage 3 partition weights of teacher too?
For the future I am interested in reducing memory footprint of teacher, can deepspeed be used to partition teacher weights in this case?
I'd really appeciate your guidance, thanks!
What version are you seeing the problem on?
No response
How to reproduce the bug
No response
Error messages and logs
transformer_outputs = self.teacher_transformer(x)
File "/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1109, in _call_impl
result = hook(self, input)
File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 348, in
_pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 478, in
pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module)
File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 349,
in fetch_sub_module
self.__all_gather_params(params_to_prefetch)
File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 399,
in __all_gather_params
handle = partitioned_params[0].all_gather_coalesced(partitioned_params)
File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 861, in
all_gather_coalesced
for p in params),
File "/lib/python3.7/site-packages/deepspeed/runtime/utils.py", line 870, in get_only_unique_item
raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object
Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>
ERROR (RANK-0) RuntimeError occurred: expected there to be only one unique element in <generator object
Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>
Traceback (most recent call last):
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 349,
in fetch_sub_module
self.__all_gather_params(params_to_prefetch)
File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 399,
in __all_gather_params
handle = partitioned_params[0].all_gather_coalesced(partitioned_params)
File "/lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 861, in
all_gather_coalesced
for p in params),
File "/lib/python3.7/site-packages/deepspeed/runtime/utils.py", line 870, in get_only_unique_item
raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object
Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>
ERROR Error running step in dev mode:
RuntimeError occurred: expected there to be only one unique element in <generator object
Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>
Traceback (most recent call last):
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 349,
in fetch_sub_module
self.__all_gather_params(params_to_prefetch)
File "lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/lib/python3.7/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 399,
in __all_gather_params
handle = partitioned_params[0].all_gather_coalesced(partitioned_params)
File "lib/python3.7/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
ret_val = func(*args, **kwargs)
File "lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 861, in
all_gather_coalesced
for p in params),
File "/lib/python3.7/site-packages/deepspeed/runtime/utils.py", line 870, in get_only_unique_item
raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object
Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7fe4338e69d0>
Error messages and logs here please
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
cc @awaelchli
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requestedstrategy: deepspeedver: 2.0.x