Open
Description
🐛 Bug
Whenever I try to restore the state of a previous (broken) run and I have Early Stop configured, I got the following error:
RuntimeError: Early stopping conditioned on metric `val/f1/default` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: ``
It seems the metrics are not saved together with the Checkpoint data and Early Stop tries to be applied straight away.
To Reproduce
- Start training a model with Checkpointing (best and last) and Early Stop on a validation metric. I have used Pytorch-Lightning Cli for this.
- The model crashes after some checkpoints were saved
- Try to re-run the training passing
ckpt_path
with the last checkpoint path to the trainer (with the same Early Stopping settings). - The error happens.
Expected behavior
The model should resume training, knowing the last best value, and apply Early Stopping as appropriate.
Environment
- CUDA:
- GPU:
- NVIDIA A100-SXM4-80GB
- available: True
- version: 11.4
- GPU:
- Packages:
- numpy: 1.21.2
- pyTorch_debug: False
- pyTorch_version: 1.10.0a0+0aef44c
- pytorch-lightning: 1.6.1
- tqdm: 4.62.3
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.12
- version: LR scheduler + train refactor #103-Ubuntu SMP Fri Nov 26 16:13:00 UTC 2021
Additional context
Running using the Pytorch Lightning Cli