Skip to content

Conversation

@eginhard
Copy link
Contributor

@eginhard eginhard commented Dec 1, 2023

TLDR: fixes loading of models via --continue_path

The issue

When resuming training via the --continue_path argument, first the following
error is logged, but training continues:

Traceback (most recent call last):
  File ".../lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File ".../lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File ".../lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File ".../lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: must be real number, not dict
Call stack:
  File ".../train_vits.py", line 120, in <module>
    trainer.fit()
  File ".../trainer/trainer.py", line 1826, in fit
    self._fit()
  File ".../trainer/trainer.py", line 1764, in _fit
    self._restore_best_loss()
  File ".../trainer/trainer.py", line 1728, in _restore_best_loss
    logger.info(" > Starting with loaded last best loss %f", self.best_loss)
Message: ' > Starting with loaded last best loss %f'
Arguments: {'train_loss': 18.22130616851475, 'eval_loss': None}

Then the following error occurs at the end of the epoch and training stops:

  File ".../trainer/io.py", line 183, in save_best_model
    if current_loss < best_loss:
TypeError: '<' not supported between instances of 'float' and 'dict'

This has been observed multiple times:

There are multiple open PRs to fix some aspects of this issue.

Others have fixed it in their Trainer forks:

The reason

This error occurs because #121 changed

model_loss={"train_loss": train_loss, "eval_loss": eval_loss},
to save the model_loss as a dict instead of just a float.
model_loss=current_loss,

still saves a float in model_loss, so loading the best model would still work fine. Loading a model via --restore-path also works fine because in that case the best loss is reset and not initialised from the saved model.

This fix

  • changes save_best_model() to also save a dict with train and eval loss, so that this is consistent everywhere
  • ensures that the model loader can handle both float and dict model_loss for backwards compatibility
  • adds relevant test cases that would have previously failed

There were errors when loading models with `--continue_path` because coqui-ai#121
changed
https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/trainer.py#L1924
to save the `model_loss` as `{"train_loss": train_loss, "eval_loss": eval_loss}`
instead of just a float.
https://github.com/coqui-ai/Trainer/blob/47781f58d2714d8139dc00f57dbf64bcc14402b7/trainer/io.py#L195
still saves a float in `model_loss`, so loading the best model would still work
fine. Loading a model via `--restore-path` also works fine because in that case
the best loss is reset and not initialised from the saved model.

This fix:
- changes `save_best_model()` to also save a dict with train and eval loss, so
  that this is consistent everywhere
- ensures that the model loader can handle both float and dict
  `model_loss` for backwards compatibility
- adds relevant test cases
@erogol
Copy link
Member

erogol commented Dec 4, 2023

@eginhard thanks for the PR. There is a small fixup needed for a style issue. Else good to go.

@eginhard
Copy link
Contributor Author

eginhard commented Dec 4, 2023

@erogol I simplified that if statement, should be good now.

@erogol erogol merged commit 8d4849c into coqui-ai:main Dec 5, 2023
erogol added a commit that referenced this pull request Dec 7, 2023
eginhard added a commit to eginhard/coqui-trainer that referenced this pull request Dec 8, 2023
erogol pushed a commit that referenced this pull request Dec 12, 2023
@eginhard eginhard deleted the fix-continue branch April 3, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants