Skip to content

Separate ToMessages transforms for text+image #1862

Open
@RdoubleA

Description

Discussion for context: #1820 (comment)

Both ShareGPTToMessages and InputOutputToMessages support image data if specified. However, this is leading to increasingly complex logic in the class and could potentially become further bloated once we add support for more modalities. For example, if we added support for audio, we would need to add an audio column to column map, add logic for loading the audio files into the Message, and this compounds the more modalities we support. It's also not immediately clear that these support image data unless a user looks through docstrings carefully.

An alternative approach would be to have separate transforms for text-only, text+image, text+other modality, etc:

  • ShareGPTToMessages, ShareGPTImageToMessages
  • InputOutputToMessages, InputOutputImageToMessages

The drawback of this is that more modalities means more permutations we may have to create message transforms for. Open to other options.

cc @SalmanMohammadi @joecummings @krammnic

Metadata

Assignees

No one assigned

    Labels

    better engineeringTasks which help improve eng productivity e.g. building tools, cleaning up code, writing docsdiscussionStart a discussion

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions