Skip to content

Restoring Trainer State with Early Stop fails #13225

Open
@lersouza

Description

@lersouza

🐛 Bug

Whenever I try to restore the state of a previous (broken) run and I have Early Stop configured, I got the following error:

RuntimeError: Early stopping conditioned on metric `val/f1/default` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: ``

It seems the metrics are not saved together with the Checkpoint data and Early Stop tries to be applied straight away.

To Reproduce

  1. Start training a model with Checkpointing (best and last) and Early Stop on a validation metric. I have used Pytorch-Lightning Cli for this.
  2. The model crashes after some checkpoints were saved
  3. Try to re-run the training passing ckpt_path with the last checkpoint path to the trainer (with the same Early Stopping settings).
  4. The error happens.

Expected behavior

The model should resume training, knowing the last best value, and apply Early Stopping as appropriate.

Environment

  • CUDA:
    • GPU:
      • NVIDIA A100-SXM4-80GB
    • available: True
    • version: 11.4
  • Packages:
    • numpy: 1.21.2
    • pyTorch_debug: False
    • pyTorch_version: 1.10.0a0+0aef44c
    • pytorch-lightning: 1.6.1
    • tqdm: 4.62.3
  • System:

Additional context

Running using the Pytorch Lightning Cli

cc @carmocca @awaelchli @rohitgr7

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions