fix: make --continue_path work again #131

eginhard · 2023-12-01T22:13:43Z

TLDR: fixes loading of models via --continue_path

The issue

When resuming training via the --continue_path argument, first the following
error is logged, but training continues:

Traceback (most recent call last):
  File ".../lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File ".../lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File ".../lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File ".../lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: must be real number, not dict
Call stack:
  File ".../train_vits.py", line 120, in <module>
    trainer.fit()
  File ".../trainer/trainer.py", line 1826, in fit
    self._fit()
  File ".../trainer/trainer.py", line 1764, in _fit
    self._restore_best_loss()
  File ".../trainer/trainer.py", line 1728, in _restore_best_loss
    logger.info(" > Starting with loaded last best loss %f", self.best_loss)
Message: ' > Starting with loaded last best loss %f'
Arguments: {'train_loss': 18.22130616851475, 'eval_loss': None}

Then the following error occurs at the end of the epoch and training stops:

  File ".../trainer/io.py", line 183, in save_best_model
    if current_loss < best_loss:
TypeError: '<' not supported between instances of 'float' and 'dict'

This has been observed multiple times:

There are multiple open PRs to fix some aspects of this issue.

Others have fixed it in their Trainer forks:

The reason

This error occurs because #121 changed

Trainer/trainer/trainer.py

Line 1924 in 47781f5

model_loss={"train_loss": train_loss, "eval_loss": eval_loss},

to save the model_loss as a dict instead of just a float.

Trainer/trainer/io.py

Line 195 in 47781f5

model_loss=current_loss,

still saves a float in model_loss, so loading the best model would still work fine. Loading a model via --restore-path also works fine because in that case the best loss is reset and not initialised from the saved model.

This fix

changes save_best_model() to also save a dict with train and eval loss, so that this is consistent everywhere
ensures that the model loader can handle both float and dict model_loss for backwards compatibility
adds relevant test cases that would have previously failed

There were errors when loading models with `--continue_path` because coqui-ai#121 changed https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L1924 to save the `model_loss` as `{"train_loss": train_loss, "eval_loss": eval_loss}` instead of just a float. https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/io.py#L195 still saves a float in `model_loss`, so loading the best model would still work fine. Loading a model via `--restore-path` also works fine because in that case the best loss is reset and not initialised from the saved model. This fix: - changes `save_best_model()` to also save a dict with train and eval loss, so that this is consistent everywhere - ensures that the model loader can handle both float and dict `model_loss` for backwards compatibility - adds relevant test cases

erogol · 2023-12-04T15:07:54Z

@eginhard thanks for the PR. There is a small fixup needed for a style issue. Else good to go.

eginhard · 2023-12-04T15:25:17Z

@erogol I simplified that if statement, should be good now.

This reverts commit 8d4849c.

This reverts commit 695a699.

eginhard force-pushed the fix-continue branch from 154c2f9 to 2b76791 Compare December 1, 2023 22:19

eginhard mentioned this pull request Dec 4, 2023

[Bug] Cannot restore from checkpoint coqui-ai/TTS#3360

Closed

fixup! fix: make --continue_path work again

3fddff1

erogol merged commit 8d4849c into coqui-ai:main Dec 5, 2023

Edresson mentioned this pull request Dec 6, 2023

Bug fix in MP3 and FLAC compute length on TTSDataset coqui-ai/TTS#3092

Merged

erogol added a commit that referenced this pull request Dec 7, 2023

Revert "fix: make --continue_path work again (#131)"

695a699

This reverts commit 8d4849c.

erogol mentioned this pull request Dec 7, 2023

Revert "fix: make --continue_path work again" #133

Merged

eginhard added a commit to eginhard/coqui-trainer that referenced this pull request Dec 8, 2023

Revert "Revert "fix: make --continue_path work again (coqui-ai#131)""

39dce9e

This reverts commit 695a699.

eginhard mentioned this pull request Dec 8, 2023

Revert "Revert "fix: make --continue_path work again (#131)"" #135

Merged

erogol pushed a commit that referenced this pull request Dec 12, 2023

Revert "Revert "fix: make --continue_path work again (#131)"" (#135)

7de3bc9

This reverts commit 695a699.

eginhard deleted the fix-continue branch April 3, 2024 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make --continue_path work again #131

fix: make --continue_path work again #131

Uh oh!

eginhard commented Dec 1, 2023 •

edited

Loading

Uh oh!

erogol commented Dec 4, 2023

Uh oh!

eginhard commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: make --continue_path work again #131

fix: make --continue_path work again #131

Uh oh!

Conversation

eginhard commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The issue

The reason

This fix

Uh oh!

erogol commented Dec 4, 2023

Uh oh!

eginhard commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eginhard commented Dec 1, 2023 •

edited

Loading