Open
Description
The UserInput
struct can represent a series of messages with media attached to each image:
return UserInput(
chat: [
.system(generate.system),
.user(prompt, images: media.images, videos: media.videos),
],
processing: media.processing
)
This could include back and forth between the user and assistant including adding additional media.
The UserInputProcessor
converts this to an LMInput
:
public struct LMInput {
public let text: Text
public let image: ProcessedImage?
public let video: ProcessedVideo?
but that only allows for one set of image/video. This should probably have:
public let images: [ProcessedImage]
public let videos: [ProcessedVideo]
though the model would have to be updated to take advantage of that.
Consider this chat:
> /image /tmp/img.jpeg
> what animal is in the image?
[["role": "system", "content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]]], ["role": "user", "content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**]]]
The animal in the image is a dog.
> /image /tmp/img2.jpeg
> describe the second image
[["content": [["text": "You are a helpful assistant who answers questions in English.", "type": "text"]], "role": "system"], ["content": [["text": "what animal is in the image?", "type": "text"], **["type": "image"]**], "role": "user"], ["content": [["type": "text", "text": "The animal in the image is a dog."]], "role": "assistant"], ["content": [["type": "text", "text": "describe the second image"], **["type": "image"]**], "role": "user"]]
The image shows a dog wearing a Santa hat.
Ideally this would present the second image for the second image marker. As it is today it will combine both images and inject them for the first marker.
Metadata
Metadata
Assignees
Labels
No labels