Skip to content

LMInput restricts model input to a single collection of images and video frames #282

Open
@davidkoski

Description

@davidkoski

See #277 and #276

The UserInput struct can represent a series of messages with media attached to each image:

        return UserInput(
            chat: [
                .system(generate.system),
                .user(prompt, images: media.images, videos: media.videos),
            ],
            processing: media.processing
        )

This could include back and forth between the user and assistant including adding additional media.

The UserInputProcessor converts this to an LMInput:

public struct LMInput {
    public let text: Text
    public let image: ProcessedImage?
    public let video: ProcessedVideo?

but that only allows for one set of image/video. This should probably have:

    public let images: [ProcessedImage]
    public let videos: [ProcessedVideo]

though the model would have to be updated to take advantage of that.

Consider this chat:

> /image /tmp/img.jpeg


> what animal is in the image?
[["role": "system", "content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]]], ["role": "user", "content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**]]]
The animal in the image is a dog.

> /image /tmp/img2.jpeg


> describe the second image
[["content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]], "role": "system"], ["content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**], "role": "user"], ["content": [["type": "text", "text": "The animal in the image is a dog."]], "role": "assistant"], ["content": [["type": "text", "text": "describe the second image"], **["type": "image"]**], "role": "user"]]
The image shows a dog wearing a Santa hat.

Ideally this would present the second image for the second image marker. As it is today it will combine both images and inject them for the first marker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions