trying to implement adding captions to model metadata #506

SamTyurenkov · 2025-08-27T10:11:10Z

im just about to test it, will post if it works or not additionally. and update the branch if needed

…etadata

SamTyurenkov · 2025-08-27T10:43:09Z

Well, it seems to be working:

  "ss_datasets": [
    {
      "batch_size_per_device": 1,
      "bucket_no_upscale": false,
      "caption_count": 2,
      "caption_extension": null,
      "enable_bucket": true,
      "frame_extraction": "full",
      "frame_sample": 1,
      "frame_stride": 1,
      "has_control": false,
      "max_frames": 81,
      "num_repeats": 2,
      "resolution": [
        1024,
        512
      ],
      "sample_captions": [
        "My caption 1.",
        "My caption 2."
      ],
      "source_fps": 16,
      "target_frames": [
        1
      ],
      "video_jsonl_file": "dataset_16fps_1024x512.jsonl"
    },

kohya-ss

Thank you for this great PR!

However, considering compatibility with sd-scripts, accessible files during training, and maintainability, I would appreciate it if you could make some modifications.

This would require fairly complex changes, which is why tag_frequency metadata has not been implemented yet...

However, I have once again recognized the need for metadata, so if it is difficult for you to finalize it, I would like to implement it as soon as I have time.

.cursor/rules/project-rule.mdc

src/musubi_tuner/hv_train_network.py

kohya-ss · 2025-08-27T12:47:14Z

src/musubi_tuner/hv_train_network.py

I'm sorry that this is a specification that is not in the documentation, but for training, image files, caption files, and JSONL files are not necessary, and training will work using only the cached files. So we cannot access caption files or JSON files, but can get captions from Text Encoder cache files. See:

musubi-tuner/src/musubi_tuner/dataset/image_video_dataset.py

Lines 401 to 434 in fec404c

def save_text_encoder_output_cache_common(item_info: ItemInfo, sd: dict[str, torch.Tensor], arch_fullname: str):

for key, value in sd.items():

# NaN check and show warning, replace NaN with 0

if torch.isnan(value).any():

logger.warning(f"{key} tensor has NaN: {item_info.item_key}, replace NaN with 0")

value[torch.isnan(value)] = 0

metadata = {

"architecture": arch_fullname,

"caption1": item_info.caption,

"format_version": "1.0.1",

}

if os.path.exists(item_info.text_encoder_output_cache_path):

# load existing cache and update metadata

with safetensors_utils.MemoryEfficientSafeOpen(item_info.text_encoder_output_cache_path) as f:

existing_metadata = f.metadata()

for key in f.keys():

if key not in sd: # avoid overwriting by existing cache, we keep the new one

sd[key] = f.get_tensor(key)

assert existing_metadata["architecture"] == metadata["architecture"], "architecture mismatch"

if existing_metadata["caption1"] != metadata["caption1"]:

logger.warning(f"caption mismatch: existing={existing_metadata['caption1']}, new={metadata['caption1']}, overwrite")

# TODO verify format_version

existing_metadata.pop("caption1", None)

existing_metadata.pop("format_version", None)

metadata.update(existing_metadata) # copy existing metadata except caption and format_version

else:

text_encoder_output_dir = os.path.dirname(item_info.text_encoder_output_cache_path)

os.makedirs(text_encoder_output_dir, exist_ok=True)

safetensors_utils.mem_eff_save_file(sd, item_info.text_encoder_output_cache_path, metadata=metadata)

Also, since accessing all cache files to extract metadata can be time-consuming for large datasets, and users may not want captions stored as metadata, it is better to enable this feature only when a certain option is specified.

It would also be appropriate to aggregate captions on the DataSet side, since the dataset is also used for fine tuning. Since ItemInfo is held by BucketBatchManager, it would be a good idea to delegate the process to that class from get_metadata.

tag_frequency counting is implemented in here in sd-scripts repo: https://github.com/kohya-ss/sd-scripts/blob/1470cb8508a34637aea8dcba7afb286b7caf961f/library/train_util.py#L709-L718

@kohya-ss i don't think fetching captions from cache should be implemented, even if we use text encoder cache for training and don't text encode with every run again - files are still present on disk probably, so we can just access them directly, even if text encoder is cached already.

Even if you want it, it might be a different PR, as it probably won't break the process to use it as it is now.

Accessing the text encoder cache is safe because training is not possible without the text encoder cache, and training is possible without captions.
In fact, for example, with cloud training, training is possible simply by uploading the cache files, and it is assumed that there will be no caption files.

Accessing the text encoder cache is safe because training is not possible without the text encoder cache, and training is possible without captions. In fact, for example, with cloud training, training is possible simply by uploading the cache files, and it is assumed that there will be no caption files.

why would you upload cache files instead of caption files? it seems like unrealistic behavior, if you train on server, you prepare cache on server. not really reasonable to split work into two machines - prepare cache on one machine and train on other machine.

Also, often you might want to change caption or add to/remove from dataset - and thus creating new cache is a must.
I don't even have text encoder locally, but even if i had I wouldn't split the process.

Captioning is a relatively light task, so it's practical to do it on a local GPU, and many users would prefer to reduce the time spent on cloud servers.

…rds and ss_trigger_phrases

arledesma · 2025-09-01T01:58:56Z

sweet! I had this same enhancement in my mental backlog. 👏🏽

SamTyurenkov added 2 commits August 27, 2025 13:09

testing approach

67321ed

Merge remote-tracking branch 'upstream/main' into video_captions_to_m…

a20ce3e

…etadata

SamTyurenkov marked this pull request as ready for review August 27, 2025 10:43

kohya-ss requested changes Aug 27, 2025

View reviewed changes

SamTyurenkov added 5 commits August 31, 2025 12:40

.cursor to gitignore

428b146

rewrite method to save into ss_tag_frequency instead of ss_trigger_wo…

d2631e8

…rds and ss_trigger_phrases

set max top words to 200

6e7a61b

remove captions from datasets object

e1ed4fb

Merge branch 'main' into video_captions_to_metadata

0aacad9

SamTyurenkov marked this pull request as draft August 31, 2025 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

trying to implement adding captions to model metadata #506

trying to implement adding captions to model metadata #506

Uh oh!

SamTyurenkov commented Aug 27, 2025

Uh oh!

SamTyurenkov commented Aug 27, 2025

Uh oh!

kohya-ss left a comment

Uh oh!

Uh oh!

Uh oh!

kohya-ss Aug 27, 2025

Uh oh!

kohya-ss Aug 27, 2025

Uh oh!

SamTyurenkov Aug 31, 2025

Uh oh!

kohya-ss Aug 31, 2025

Uh oh!

SamTyurenkov Aug 31, 2025

Uh oh!

kohya-ss Aug 31, 2025

Uh oh!

arledesma commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def save_text_encoder_output_cache_common(item_info: ItemInfo, sd: dict[str, torch.Tensor], arch_fullname: str):
	for key, value in sd.items():
	# NaN check and show warning, replace NaN with 0
	if torch.isnan(value).any():
	logger.warning(f"{key} tensor has NaN: {item_info.item_key}, replace NaN with 0")
	value[torch.isnan(value)] = 0

	metadata = {
	"architecture": arch_fullname,
	"caption1": item_info.caption,
	"format_version": "1.0.1",
	}

	if os.path.exists(item_info.text_encoder_output_cache_path):
	# load existing cache and update metadata
	with safetensors_utils.MemoryEfficientSafeOpen(item_info.text_encoder_output_cache_path) as f:
	existing_metadata = f.metadata()
	for key in f.keys():
	if key not in sd: # avoid overwriting by existing cache, we keep the new one
	sd[key] = f.get_tensor(key)

	assert existing_metadata["architecture"] == metadata["architecture"], "architecture mismatch"
	if existing_metadata["caption1"] != metadata["caption1"]:
	logger.warning(f"caption mismatch: existing={existing_metadata['caption1']}, new={metadata['caption1']}, overwrite")
	# TODO verify format_version

	existing_metadata.pop("caption1", None)
	existing_metadata.pop("format_version", None)
	metadata.update(existing_metadata) # copy existing metadata except caption and format_version
	else:
	text_encoder_output_dir = os.path.dirname(item_info.text_encoder_output_cache_path)
	os.makedirs(text_encoder_output_dir, exist_ok=True)

	safetensors_utils.mem_eff_save_file(sd, item_info.text_encoder_output_cache_path, metadata=metadata)

Uh oh!

trying to implement adding captions to model metadata #506

Are you sure you want to change the base?

trying to implement adding captions to model metadata #506

Uh oh!

Conversation

SamTyurenkov commented Aug 27, 2025

Uh oh!

SamTyurenkov commented Aug 27, 2025

Uh oh!

kohya-ss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kohya-ss Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

kohya-ss Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

SamTyurenkov Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

kohya-ss Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

SamTyurenkov Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

kohya-ss Aug 31, 2025

Choose a reason for hiding this comment

Uh oh!

arledesma commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants