Skip to content

Update collect_features to allow different modalities more easily in the Trainer #3276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tomaarsen
Copy link
Collaborator

@tomaarsen tomaarsen commented Mar 21, 2025

Alternative to #3225

Hello!

Pull Request overview

  • Update collect_features to allow different modalities more easily in the Trainer

Details

This is a follow-up of #3225 by @mengerj. He noticed that if you want to train with more custom modules that don't tokenize with input_ids, pixel_values, or sentence_embedding, then collect_features won't recognize your features. He proposed to allow users to provide more options than those 3, but I think a refactor of collect_features might be more useful:

Instead of relying on the suffixes in e.g. query_input_ids, or sentence_1_pixel_values, we could also rely on the data collator to provide us with information about which feature columns exist (e.g. query, sentence_1) - the data collator has easy access to this.

This means that all (custom) features should work out of the box, without the user having to specify anything special. What do you think, @mengerj?

cc @NohTow I should preserve backwards compatibility here, but I'm giving you a heads up that this change might go through for v4.0.

  • Tom Aarsen

@mengerj
Copy link

mengerj commented Mar 23, 2025

Hi @tomaarsen,

I think the idea is good and I would like to work on it with you. I am currently on vacation, but in April I will take a closer look at it. Maybe we could also have a video chat to talk about the specifics and other ways to improve support for multiple modalities?

Best,
Jonatan

@tomaarsen tomaarsen marked this pull request as draft March 24, 2025 09:03
@tomaarsen
Copy link
Collaborator Author

Thanks for having a quick look - sadly it seems like accelerate doesn't like it when I try to return a list of non-tensors - it will try and concatenate them, but that fails with strings. In short, this fix doesn't work as well as I would have hoped. An alternative is to look at the dataset column_names as those are also the features, but set_transform makes it so it's possible for a dataset to return data with different names than column_names, which complicates things.

I'll put this on draft for now, so we can tackle it after v4.0+.

As for a video chat - I prefer working asynchronously over chat on GitHub, so that would be my preference.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants