fix: make --continue_path work again #131
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TLDR: fixes loading of models via
--continue_pathThe issue
When resuming training via the
--continue_pathargument, first the followingerror is logged, but training continues:
Then the following error occurs at the end of the epoch and training stops:
This has been observed multiple times:
There are multiple open PRs to fix some aspects of this issue.
Others have fixed it in their Trainer forks:
The reason
This error occurs because #121 changed
Trainer/trainer/trainer.py
Line 1924 in 47781f5
model_lossas a dict instead of just a float.Trainer/trainer/io.py
Line 195 in 47781f5
still saves a float in
model_loss, so loading the best model would still work fine. Loading a model via--restore-pathalso works fine because in that case the best loss is reset and not initialised from the saved model.This fix
save_best_model()to also save a dict with train and eval loss, so that this is consistent everywheremodel_lossfor backwards compatibility