Separate `ToMessages` transforms for text+image

Discussion for context: https://github.com/pytorch/torchtune/pull/1820#discussion_r1804488044

Both `ShareGPTToMessages` and `InputOutputToMessages` support image data if specified. However, this is leading to increasingly complex logic in the class and could potentially become further bloated once we add support for more modalities. For example, if we added support for audio, we would need to add an audio column to column map, add logic for loading the audio files into the Message, and this compounds the more modalities we support. It's also not immediately clear that these support image data unless a user looks through docstrings carefully.

An alternative approach would be to have separate transforms for text-only, text+image, text+other modality, etc:
- ShareGPTToMessages, ShareGPTImageToMessages
- InputOutputToMessages, InputOutputImageToMessages

The drawback of this is that more modalities means more permutations we may have to create message transforms for. Open to other options.

cc @SalmanMohammadi @joecummings @krammnic 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate `ToMessages` transforms for text+image #1862

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Separate ToMessages transforms for text+image #1862

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Separate `ToMessages` transforms for text+image #1862