Skip to content

Conversation

@SamTyurenkov
Copy link
Contributor

im just about to test it, will post if it works or not additionally. and update the branch if needed

@SamTyurenkov
Copy link
Contributor Author

Well, it seems to be working:

  "ss_datasets": [
    {
      "batch_size_per_device": 1,
      "bucket_no_upscale": false,
      "caption_count": 2,
      "caption_extension": null,
      "enable_bucket": true,
      "frame_extraction": "full",
      "frame_sample": 1,
      "frame_stride": 1,
      "has_control": false,
      "max_frames": 81,
      "num_repeats": 2,
      "resolution": [
        1024,
        512
      ],
      "sample_captions": [
        "My caption 1.",
        "My caption 2."
      ],
      "source_fps": 16,
      "target_frames": [
        1
      ],
      "video_jsonl_file": "dataset_16fps_1024x512.jsonl"
    },

@SamTyurenkov SamTyurenkov marked this pull request as ready for review August 27, 2025 10:43
Copy link
Owner

@kohya-ss kohya-ss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great PR!

However, considering compatibility with sd-scripts, accessible files during training, and maintainability, I would appreciate it if you could make some modifications.

This would require fairly complex changes, which is why tag_frequency metadata has not been implemented yet...

However, I have once again recognized the need for metadata, so if it is difficult for you to finalize it, I would like to implement it as soon as I have time.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry that this is a specification that is not in the documentation, but for training, image files, caption files, and JSONL files are not necessary, and training will work using only the cached files. So we cannot access caption files or JSON files, but can get captions from Text Encoder cache files. See:

def save_text_encoder_output_cache_common(item_info: ItemInfo, sd: dict[str, torch.Tensor], arch_fullname: str):
for key, value in sd.items():
# NaN check and show warning, replace NaN with 0
if torch.isnan(value).any():
logger.warning(f"{key} tensor has NaN: {item_info.item_key}, replace NaN with 0")
value[torch.isnan(value)] = 0
metadata = {
"architecture": arch_fullname,
"caption1": item_info.caption,
"format_version": "1.0.1",
}
if os.path.exists(item_info.text_encoder_output_cache_path):
# load existing cache and update metadata
with safetensors_utils.MemoryEfficientSafeOpen(item_info.text_encoder_output_cache_path) as f:
existing_metadata = f.metadata()
for key in f.keys():
if key not in sd: # avoid overwriting by existing cache, we keep the new one
sd[key] = f.get_tensor(key)
assert existing_metadata["architecture"] == metadata["architecture"], "architecture mismatch"
if existing_metadata["caption1"] != metadata["caption1"]:
logger.warning(f"caption mismatch: existing={existing_metadata['caption1']}, new={metadata['caption1']}, overwrite")
# TODO verify format_version
existing_metadata.pop("caption1", None)
existing_metadata.pop("format_version", None)
metadata.update(existing_metadata) # copy existing metadata except caption and format_version
else:
text_encoder_output_dir = os.path.dirname(item_info.text_encoder_output_cache_path)
os.makedirs(text_encoder_output_dir, exist_ok=True)
safetensors_utils.mem_eff_save_file(sd, item_info.text_encoder_output_cache_path, metadata=metadata)

Also, since accessing all cache files to extract metadata can be time-consuming for large datasets, and users may not want captions stored as metadata, it is better to enable this feature only when a certain option is specified.

It would also be appropriate to aggregate captions on the DataSet side, since the dataset is also used for fine tuning. Since ItemInfo is held by BucketBatchManager, it would be a good idea to delegate the process to that class from get_metadata.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kohya-ss i don't think fetching captions from cache should be implemented, even if we use text encoder cache for training and don't text encode with every run again - files are still present on disk probably, so we can just access them directly, even if text encoder is cached already.

Even if you want it, it might be a different PR, as it probably won't break the process to use it as it is now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the text encoder cache is safe because training is not possible without the text encoder cache, and training is possible without captions.
In fact, for example, with cloud training, training is possible simply by uploading the cache files, and it is assumed that there will be no caption files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the text encoder cache is safe because training is not possible without the text encoder cache, and training is possible without captions. In fact, for example, with cloud training, training is possible simply by uploading the cache files, and it is assumed that there will be no caption files.

why would you upload cache files instead of caption files? it seems like unrealistic behavior, if you train on server, you prepare cache on server. not really reasonable to split work into two machines - prepare cache on one machine and train on other machine.

Also, often you might want to change caption or add to/remove from dataset - and thus creating new cache is a must.
I don't even have text encoder locally, but even if i had I wouldn't split the process.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Captioning is a relatively light task, so it's practical to do it on a local GPU, and many users would prefer to reduce the time spent on cloud servers.

@SamTyurenkov SamTyurenkov marked this pull request as draft August 31, 2025 10:34
@arledesma
Copy link
Contributor

sweet! I had this same enhancement in my mental backlog. 👏🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants