Add Validation Splits and Logging, Refactor Dataset Blueprints, and Improve Float8 Support #63

NSFW-API · 2025-01-26T15:40:57Z

• Introduces separate train/val dataset groups in the blueprint config and uses them for both cache_latents.py and cache_text_encoder_outputs.py.
• Extends hv_train_network.py with a validate() function that runs on the val_dataset_group each epoch, computes MSE loss, and logs val_loss via Accelerate (e.g., accelerator.log(...)).
• Refactors config_utils.py to return separate train_dataset_group and val_dataset_group, combining them when needed (all_datasets).
• Adds optional float8 fallback handling in token_refiner.py, safely casting float8 → float for calculations and then back to float8.
• Adjusts cache skipping/keeping logic to handle old cache files, and changes the code to enumerate all_datasets instead of only train datasets.
• Overall, this PR makes it possible to run a distinct validation pass each epoch, log validation performance, and unify caching for both train and val.

Training run 1

Training run 2

cache_text_encoder_outputs.py

kohya-ss · 2025-01-26T23:21:42Z

Thank you! I think this is a great starting point.

As for the dataset settings, just an idea, how about adding the is_val (or is_validation) attribute to the dataset? This might help minimize modifications to cache_latents.py etc.

… datasets

NSFW-API · 2025-01-28T22:46:53Z

hunyuan_model/token_refiner.py

        if mask is None:
            context_aware_representations = x.mean(dim=1)
        else:
-            mask_float = mask.float().unsqueeze(-1)  # [b, s1, 1]


I was getting a crash on my 4090 about type casting I think, this resolved it.

NSFW-API · 2025-01-29T02:01:00Z

Thank you! I think this is a great starting point.

As for the dataset settings, just an idea, how about adding the is_val (or is_validation) attribute to the dataset? This might help minimize modifications to cache_latents.py etc.

I missed this earlier. Will take a look at what it would involve but I don't see why that wouldn't work.

NSFW-API · 2025-02-04T19:10:14Z

Thank you! I think this is a great starting point.

As for the dataset settings, just an idea, how about adding the is_val (or is_validation) attribute to the dataset? This might help minimize modifications to cache_latents.py etc.

I've implemented this change, no longer needing to modify the caching scripts. Good call out!

NSFW-API · 2025-02-05T02:49:52Z

After re-testing, looks like the cache_latents.py script is failing. It might need some changes to accommodate the is_val change after all. Will resubmit once I make sure all parts are working as expected.

NSFW-API · 2025-02-05T04:17:44Z

Ok, just needed a small change to those caching files.

… new is_val dataset

Enyakk · 2025-02-09T15:14:44Z

I am trying out the PR and it looks like great instrument to add to the toolkit. However two changes, if practical, would enhance the effectiveness:

Large datasets might only have 1-2 epochs during the entirety of the training run so a steps_validate option would be appreciated
Multiple named validation datasets in tensorboard: Training contents may have multiple goals. Being able to measure each goal against a defined validation dataset (ie. character A, concept B, regularization C) and display the current validation state would enhance productivity even further.

NSFW-API · 2025-02-09T19:32:07Z

I got reports that this isn't working when you don't provide a validation dataset, which it was previously as validation should of course be optional. Will see why that broke with the latest commit, also it seems I need to add support for JSON configs in addition to TOML. I missed that there are a variety of ways to specify datasets.

niceguy4 · 2025-08-26T03:06:51Z

Any opportunities with revisiting this implementation? Validation seems to be an important element for verifying training. It appears to work well when used in conjunction with the tensor board loss chart.

Initial working validation and stable loss implementation

69bb6a9

NSFW-API commented Jan 26, 2025

View reviewed changes

cache_text_encoder_outputs.py Show resolved Hide resolved

Set fixed timestep to stabilize validation loss

3bc1a71

NSFW-API force-pushed the validation-dataset-stable-loss branch from 27a633a to 3bc1a71 Compare January 26, 2025 22:54

NSFW-API added 2 commits January 26, 2025 17:53

Update dataset configuration documentation with details on validation…

484cf9a

… datasets

Remove excessive changes

322775a

NSFW-API marked this pull request as ready for review January 28, 2025 22:44

NSFW-API commented Jan 28, 2025

View reviewed changes

NSFW-API added 3 commits February 4, 2025 09:38

Add is_val to dataset config, clean up code

9bb6ad6

Fix issue with loading datasets during training

e287d6d

Fix formatting in cache scripts

5779bc6

NSFW-API marked this pull request as draft February 5, 2025 02:48

NSFW-API marked this pull request as ready for review February 5, 2025 04:17

Update cache latents and text encoder output scripts to work with the…

5064a16

… new is_val dataset

NSFW-API marked this pull request as draft February 9, 2025 19:30

Sarania mentioned this pull request Apr 9, 2025

Validation Loss? #218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Validation Splits and Logging, Refactor Dataset Blueprints, and Improve Float8 Support #63

Add Validation Splits and Logging, Refactor Dataset Blueprints, and Improve Float8 Support #63

Uh oh!

NSFW-API commented Jan 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

kohya-ss commented Jan 26, 2025 •

edited

Loading

Uh oh!

NSFW-API Jan 28, 2025

Uh oh!

NSFW-API commented Jan 29, 2025

Uh oh!

NSFW-API commented Feb 4, 2025

Uh oh!

NSFW-API commented Feb 5, 2025

Uh oh!

NSFW-API commented Feb 5, 2025

Uh oh!

Enyakk commented Feb 9, 2025

Uh oh!

NSFW-API commented Feb 9, 2025

Uh oh!

niceguy4 commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Add Validation Splits and Logging, Refactor Dataset Blueprints, and Improve Float8 Support #63

Are you sure you want to change the base?

Add Validation Splits and Logging, Refactor Dataset Blueprints, and Improve Float8 Support #63

Uh oh!

Conversation

NSFW-API commented Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kohya-ss commented Jan 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NSFW-API Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

NSFW-API commented Jan 29, 2025

Uh oh!

NSFW-API commented Feb 4, 2025

Uh oh!

NSFW-API commented Feb 5, 2025

Uh oh!

NSFW-API commented Feb 5, 2025

Uh oh!

Enyakk commented Feb 9, 2025

Uh oh!

NSFW-API commented Feb 9, 2025

Uh oh!

niceguy4 commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NSFW-API commented Jan 26, 2025 •

edited

Loading

kohya-ss commented Jan 26, 2025 •

edited

Loading