Update collect_features to allow different modalities more easily in the Trainer #3276

tomaarsen · 2025-03-21T11:05:40Z

Alternative to #3225

Hello!

Pull Request overview

Update collect_features to allow different modalities more easily in the Trainer

Details

This is a follow-up of #3225 by @mengerj. He noticed that if you want to train with more custom modules that don't tokenize with input_ids, pixel_values, or sentence_embedding, then collect_features won't recognize your features. He proposed to allow users to provide more options than those 3, but I think a refactor of collect_features might be more useful:

Instead of relying on the suffixes in e.g. query_input_ids, or sentence_1_pixel_values, we could also rely on the data collator to provide us with information about which feature columns exist (e.g. query, sentence_1) - the data collator has easy access to this.

This means that all (custom) features should work out of the box, without the user having to specify anything special. What do you think, @mengerj?

cc @NohTow I should preserve backwards compatibility here, but I'm giving you a heads up that this change might go through for v4.0.

Tom Aarsen

…the Trainer

mengerj · 2025-03-23T18:45:34Z

Hi @tomaarsen,

I think the idea is good and I would like to work on it with you. I am currently on vacation, but in April I will take a closer look at it. Maybe we could also have a video chat to talk about the specifics and other ways to improve support for multiple modalities?

Best,
Jonatan

tomaarsen · 2025-03-24T09:07:26Z

Thanks for having a quick look - sadly it seems like accelerate doesn't like it when I try to return a list of non-tensors - it will try and concatenate them, but that fails with strings. In short, this fix doesn't work as well as I would have hoped. An alternative is to look at the dataset column_names as those are also the features, but set_transform makes it so it's possible for a dataset to return data with different names than column_names, which complicates things.

I'll put this on draft for now, so we can tackle it after v4.0+.

As for a video chat - I prefer working asynchronously over chat on GitHub, so that would be my preference.

Tom Aarsen

Update collect_features to allow different modalities more easily in …

16fc8dc

…the Trainer

tomaarsen mentioned this pull request Mar 21, 2025

Additional Trainer Argument for features of different modalities #3225

Open

tomaarsen marked this pull request as draft March 24, 2025 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update collect_features to allow different modalities more easily in the Trainer #3276

Update collect_features to allow different modalities more easily in the Trainer #3276

Uh oh!

tomaarsen commented Mar 21, 2025 •

edited

Loading

Uh oh!

mengerj commented Mar 23, 2025

Uh oh!

tomaarsen commented Mar 24, 2025

Uh oh!

Uh oh!

Update collect_features to allow different modalities more easily in the Trainer #3276

Are you sure you want to change the base?

Update collect_features to allow different modalities more easily in the Trainer #3276

Uh oh!

Conversation

tomaarsen commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request overview

Details

Uh oh!

mengerj commented Mar 23, 2025

Uh oh!

tomaarsen commented Mar 24, 2025

Uh oh!

Uh oh!

tomaarsen commented Mar 21, 2025 •

edited

Loading