-
-
Notifications
You must be signed in to change notification settings - Fork 166
trying to implement adding captions to model metadata #506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
trying to implement adding captions to model metadata #506
Conversation
|
Well, it seems to be working: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great PR!
However, considering compatibility with sd-scripts, accessible files during training, and maintainability, I would appreciate it if you could make some modifications.
This would require fairly complex changes, which is why tag_frequency metadata has not been implemented yet...
However, I have once again recognized the need for metadata, so if it is difficult for you to finalize it, I would like to implement it as soon as I have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry that this is a specification that is not in the documentation, but for training, image files, caption files, and JSONL files are not necessary, and training will work using only the cached files. So we cannot access caption files or JSON files, but can get captions from Text Encoder cache files. See:
musubi-tuner/src/musubi_tuner/dataset/image_video_dataset.py
Lines 401 to 434 in fec404c
| def save_text_encoder_output_cache_common(item_info: ItemInfo, sd: dict[str, torch.Tensor], arch_fullname: str): | |
| for key, value in sd.items(): | |
| # NaN check and show warning, replace NaN with 0 | |
| if torch.isnan(value).any(): | |
| logger.warning(f"{key} tensor has NaN: {item_info.item_key}, replace NaN with 0") | |
| value[torch.isnan(value)] = 0 | |
| metadata = { | |
| "architecture": arch_fullname, | |
| "caption1": item_info.caption, | |
| "format_version": "1.0.1", | |
| } | |
| if os.path.exists(item_info.text_encoder_output_cache_path): | |
| # load existing cache and update metadata | |
| with safetensors_utils.MemoryEfficientSafeOpen(item_info.text_encoder_output_cache_path) as f: | |
| existing_metadata = f.metadata() | |
| for key in f.keys(): | |
| if key not in sd: # avoid overwriting by existing cache, we keep the new one | |
| sd[key] = f.get_tensor(key) | |
| assert existing_metadata["architecture"] == metadata["architecture"], "architecture mismatch" | |
| if existing_metadata["caption1"] != metadata["caption1"]: | |
| logger.warning(f"caption mismatch: existing={existing_metadata['caption1']}, new={metadata['caption1']}, overwrite") | |
| # TODO verify format_version | |
| existing_metadata.pop("caption1", None) | |
| existing_metadata.pop("format_version", None) | |
| metadata.update(existing_metadata) # copy existing metadata except caption and format_version | |
| else: | |
| text_encoder_output_dir = os.path.dirname(item_info.text_encoder_output_cache_path) | |
| os.makedirs(text_encoder_output_dir, exist_ok=True) | |
| safetensors_utils.mem_eff_save_file(sd, item_info.text_encoder_output_cache_path, metadata=metadata) |
Also, since accessing all cache files to extract metadata can be time-consuming for large datasets, and users may not want captions stored as metadata, it is better to enable this feature only when a certain option is specified.
It would also be appropriate to aggregate captions on the DataSet side, since the dataset is also used for fine tuning. Since ItemInfo is held by BucketBatchManager, it would be a good idea to delegate the process to that class from get_metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag_frequency counting is implemented in here in sd-scripts repo: https://github.com/kohya-ss/sd-scripts/blob/1470cb8508a34637aea8dcba7afb286b7caf961f/library/train_util.py#L709-L718
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kohya-ss i don't think fetching captions from cache should be implemented, even if we use text encoder cache for training and don't text encode with every run again - files are still present on disk probably, so we can just access them directly, even if text encoder is cached already.
Even if you want it, it might be a different PR, as it probably won't break the process to use it as it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing the text encoder cache is safe because training is not possible without the text encoder cache, and training is possible without captions.
In fact, for example, with cloud training, training is possible simply by uploading the cache files, and it is assumed that there will be no caption files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing the text encoder cache is safe because training is not possible without the text encoder cache, and training is possible without captions. In fact, for example, with cloud training, training is possible simply by uploading the cache files, and it is assumed that there will be no caption files.
why would you upload cache files instead of caption files? it seems like unrealistic behavior, if you train on server, you prepare cache on server. not really reasonable to split work into two machines - prepare cache on one machine and train on other machine.
Also, often you might want to change caption or add to/remove from dataset - and thus creating new cache is a must.
I don't even have text encoder locally, but even if i had I wouldn't split the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Captioning is a relatively light task, so it's practical to do it on a local GPU, and many users would prefer to reduce the time spent on cloud servers.
|
sweet! I had this same enhancement in my mental backlog. 👏🏽 |
im just about to test it, will post if it works or not additionally. and update the branch if needed