Skip to content

ModelPruning and PyTorch Profiler incompatibility #12393

Open
@austinmw

Description

@austinmw

🐛 Bug

When I attempt to use the pruning callback (ModelPruning('l1_unstructured', amount=0.5)) in conjunction with profiler='pytorch' I get the following error:

zz2hyoqiv6-algo-1-vqegg | Traceback (most recent call last):
zz2hyoqiv6-algo-1-vqegg | File "train.py", line 51, in
zz2hyoqiv6-algo-1-vqegg | main(args)
zz2hyoqiv6-algo-1-vqegg | File "train.py", line 44, in main
zz2hyoqiv6-algo-1-vqegg | else: trainer.fit(model, dm)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
zz2hyoqiv6-algo-1-vqegg | self._call_and_handle_interrupt(
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
zz2hyoqiv6-algo-1-vqegg | return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
zz2hyoqiv6-algo-1-vqegg | return function(*args, **kwargs)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
zz2hyoqiv6-algo-1-vqegg | results = self._run(model, ckpt_path=self.ckpt_path)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1174, in _run
zz2hyoqiv6-algo-1-vqegg | self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1493, in _call_setup_hook
zz2hyoqiv6-algo-1-vqegg | self._call_callback_hooks("setup", stage=fn)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
zz2hyoqiv6-algo-1-vqegg | fn(self, self.lightning_module, *args, **kwargs)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/pruning.py", line 378, in setup
zz2hyoqiv6-algo-1-vqegg | self.original_layers.setdefault(id, _LayerRef(data=deepcopy(module), names=[]))
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | state = deepcopy(state, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
zz2hyoqiv6-algo-1-vqegg | y[deepcopy(key, memo)] = deepcopy(value, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 296, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | value = deepcopy(value, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 264, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | y = func(*args)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 263, in
zz2hyoqiv6-algo-1-vqegg | args = (deepcopy(arg, memo) for arg in args)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 237, in _deepcopy_method
zz2hyoqiv6-algo-1-vqegg | return type(x)(x.func, deepcopy(x.self, memo))
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | state = deepcopy(state, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
zz2hyoqiv6-algo-1-vqegg | y[deepcopy(key, memo)] = deepcopy(value, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | state = deepcopy(state, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
zz2hyoqiv6-algo-1-vqegg | y[deepcopy(key, memo)] = deepcopy(value, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 296, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | value = deepcopy(value, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | state = deepcopy(state, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 210, in _deepcopy_tuple
zz2hyoqiv6-algo-1-vqegg | y = [deepcopy(a, memo) for a in x]
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 210, in
zz2hyoqiv6-algo-1-vqegg | y = [deepcopy(a, memo) for a in x]
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 210, in _deepcopy_tuple
zz2hyoqiv6-algo-1-vqegg | y = [deepcopy(a, memo) for a in x]
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 210, in
zz2hyoqiv6-algo-1-vqegg | y = [deepcopy(a, memo) for a in x]
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | state = deepcopy(state, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
zz2hyoqiv6-algo-1-vqegg | y[deepcopy(key, memo)] = deepcopy(value, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | state = deepcopy(state, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
zz2hyoqiv6-algo-1-vqegg | y[deepcopy(key, memo)] = deepcopy(value, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 205, in _deepcopy_list
zz2hyoqiv6-algo-1-vqegg | append(deepcopy(a, memo))
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 172, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = _reconstruct(x, memo, *rv)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 270, in _reconstruct
zz2hyoqiv6-algo-1-vqegg | state = deepcopy(state, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 146, in deepcopy
zz2hyoqiv6-algo-1-vqegg | y = copier(x, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 230, in _deepcopy_dict
zz2hyoqiv6-algo-1-vqegg | y[deepcopy(key, memo)] = deepcopy(value, memo)
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/copy.py", line 161, in deepcopy
zz2hyoqiv6-algo-1-vqegg | rv = reductor(4)
zz2hyoqiv6-algo-1-vqegg | TypeError: cannot pickle '_io.TextIOWrapper' object
zz2hyoqiv6-algo-1-vqegg | Exception ignored in: <function BaseProfiler.del at 0x7fd53b5c7940>
zz2hyoqiv6-algo-1-vqegg | Traceback (most recent call last):
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/profiler/base.py", line 199, in del
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/profiler/pytorch.py", line 509, in teardown
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/profiler/pytorch.py", line 494, in _delete_profilers
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/profiler/pytorch.py", line 489, in _cache_functions_events
zz2hyoqiv6-algo-1-vqegg | File "/opt/conda/lib/python3.8/site-packages/torch/profiler/profiler.py", line 382, in events
zz2hyoqiv6-algo-1-vqegg | AssertionError:
zz2hyoqiv6-algo-1-vqegg | 2022-03-21 12:47:50,301 sagemaker-training-toolkit ERROR Reporting training FAILURE
zz2hyoqiv6-algo-1-vqegg | 2022-03-21 12:47:50,301 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
zz2hyoqiv6-algo-1-vqegg | ExitCode 1
zz2hyoqiv6-algo-1-vqegg | ErrorMessage "TypeError: cannot pickle '_io.TextIOWrapper' object
zz2hyoqiv6-algo-1-vqegg | Exception ignored in: <function BaseProfiler.del at 0x7fd53b5c7940> Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/profiler/base.py", line 199, in del File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/profiler/pytorch.py", line 509, in teardown File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/profiler/pytorch.py", line 494, in _delete_profilers File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/profiler/pytorch.py", line 489, in _cache_functions_events File "/opt/conda/lib/python3.8/site-packages/torch/profiler/profiler.py", line 382, in events AssertionError:"

When I comment out profiler='pytorch' the fit call runs fine. I think this may be related to pytorch/pytorch#37322
If it's not fixable, maybe a warning and disabling of one or the other?

To Reproduce

I have not yet been able to successfully reproduce with the BoringModel. If I manage to I will update this issue.

Expected behavior

No error.

Environment

You can also fill out the list below manually.
-->

  • PyTorch Lightning Version (e.g., 1.5.0): mainline
  • PyTorch Version (e.g., 1.10): 1.10
  • Python version (e.g., 3.9): 3.8
  • OS (e.g., Linux): Ubuntu
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: V100
  • How you installed PyTorch (conda, pip, source): pip git mainline

cc @carmocca @kaushikb11 @ninginthecloud @rohitgr7 @nbcsm @guotuofeng

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions