Description
Discussion for context: #1820 (comment)
Both ShareGPTToMessages
and InputOutputToMessages
support image data if specified. However, this is leading to increasingly complex logic in the class and could potentially become further bloated once we add support for more modalities. For example, if we added support for audio, we would need to add an audio column to column map, add logic for loading the audio files into the Message, and this compounds the more modalities we support. It's also not immediately clear that these support image data unless a user looks through docstrings carefully.
An alternative approach would be to have separate transforms for text-only, text+image, text+other modality, etc:
- ShareGPTToMessages, ShareGPTImageToMessages
- InputOutputToMessages, InputOutputImageToMessages
The drawback of this is that more modalities means more permutations we may have to create message transforms for. Open to other options.