-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add initial support for images in the ai chat #15410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Initial implementation of #15407
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thank you!
I'm wondering whether we could integrate the handling of images with the existing context concept and extend it with images, rather than doing it alongside it? Conceptually it would feel like a good fit, alongside other context, like files.
yes I thought about that. Basically we need to decide how we want to handle non message data, eg images and audio. as part of the request or just as "additional information" and thus part of the context? |
Maybe even both should be possible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this PR a bit more and I think it is a great feature that I really would like to get in!
Besides cleaning up the change (some unnecessary comments, out-commented code and formatting changes), I'm mostly thinking about how to best represent images in the chat session and in the UI. Currently it is added orthogonal to all available infrastructure we have.
I'm wondering whether we couldn't just reuse the file context variable or create an image context variable adjacent to it. This would give us the UI more or less for free:
- The drag'n'drop support to add images
- The + button support to add images
- Show images just like files in the chat input if they have been attached with the existing context variable UI
A minor disadvantage would be that we wouldn't show a preview of the image in the chat input. But I'm not sure this is a big deal.
We could still handle the images in the generic chat agent code (like now in this PR) but just filter out all image context variables from the context instead of reading the new request.images
. I like this as it is customizable by concrete chat agents. It might be worth though to give the getMessages
method access to the language model that is being used and add a property to it whether it even supports images. If it doesn't, we could avoid creating the image messages and show a warning to the user?
What do you think?
Great work with this PR, in any case!!! Looking forward to having this available in Theia AI!
So my thoughts about the context and why I didn't add it directly to it:
Yes I agree language models need a possibility to provide a feature list so that we can check what is supported. |
@eneufeld Thanks for the feedback! That makes sense. The way we use context so far is indeed a bit different in semantics, especially when thinking about audio. For images, though, I feel the advantages of reusing the existing infrastructure outweigh the semantic mismatch. For audio, I agree—it’s semantically equivalent to the user input, like a spoken prompt. But for images and files, they usually augment the input. Sure, it may also make sense to send an image without a user prompt (spoken or written). Actually, this may actually also be true for any file, e.g. when I copy a log file and want to send it to the LLM without writing a prompt. I don't see it as part of this PR, but we may want to consider enabling the send button if there is at least one context element (image or file). All in all, going the context-element route seems reasonable to me—especially as it simplifies integrating the UI behaviors (e.g. Drag&Drop, + button, consistent view in the chat input). But I’m totally open to your thoughts here and happy to go with whatever direction you feel is more robust long term. Thank you! |
I totally see the advantages of the context. We should just define a consistent logic for us what goes into the context (and thus what it is) and what stays outside of it. @sdirix any ideas here |
What it does
Initial implementation of #15407
How to test
Copy an image into the clipboard. Paste it using
CTRL+V
.Follow-ups
As this is the first introduction of non text data to the request, it must be decided where and how this kind of data should be stored. All adaptations on the communication from input widget to the llm must be reviewed and discussed also in regard to more kind of input data (e.g. audio).
Breaking changes
Attribution
Review checklist
Reminder for reviewers